[Google_Bootcamp_Day24]

Updated: December 01, 2020

Language model and sequence generation

P(y<1>, y<2>, y<3>, …, y) = ?

Language model estimates the probability of that particular sequence of words
Language model will be useful to represent sentences as outputs y rather than inputs x
Training set : large corpus of text

Process of language modeling

Tokenize
Map each token to vocabulary set (one-hot encoding)
If not in vocabulary set, then change to UNK (unknown words)

RNN model

rnn

Assume the example sentence is “Cats average 15 hours of sleep a day. "
a<1> make a softmax preiction to try to figure out what is the probability of the first words
y<2> = P ( _______ “cats”)
y<3> = P ( _______ “cats average”)
P(y<1>, y<2>, y<3>) = P(y<1>) * P(y<2> y<1>) * P(y<3> y<1>, y<2>)

Loss function

loss

Sampling a sequence from a trained RNN language model

sequence_model

sample from Ttrained model’s distribution to generate noble sequences of words.
randomly sample according to this soft max distribution
keep sampling until you generate an EOS token.
Also could be character-level RNN as well as word-level RNN
- pros : don’t ever have to worry about unknown word tokens
- cons : end up with much longer sequences so not as good as word level language models at capturing long range dependencies between how the the earlier parts of the sentence also affect the later part of the sentence

Vanishing gradients problem with RNNs

baisc RNNs are not very good at capturing very long-term dependencies (similar to deep-layer NN)

Gated Recurrent Unit (GRU)

RNN unit rnn

GRU unit gru gru2

element-wise multiplication tells your GRU which are the dimensions of your memory cell vector to update at every time-step, so you can choose to keep some bits constant while updating other bits.

Full GRU full_gru

gate gamma r tells you how relevant is c to computing the next candidate for c
have longer range connections, and also address vanishing gradient problems

Long short-term memory unit (LSTM)

lstm_fig lstm_formula

slightly more powerful and more general version of the GRU
separte forget gate and update gate

Connect several LSTM units lstm_connect

Bidirectional RNN

Problem of unidirectional RNN

EX1. He said, “Teddy bears are on sale!”
EX2. He said, “Teddy Roosevelt was a great President!”
not enough to just look at the first part of the sentence to figure out whether the third word “Teddy” is a part of the person’s name

Solution = Bidirectional RNN bidrectional_rnn

ex. in the case of “y_hat<3>”, input x<1>, x<2>, x<3> comes from forward recurrent components(blue) and x<4> comes from backward recurrent components(red)
bidirectional RNN with a LSTM units appears to be commonly used
cons : need the entire sequence of data before you can make predictions anywhere

Deep RNNs

deep_rnn

deep RNNs are quite computationally expensive to train
Because of the temporal dimension, these networks can already get quite big even if you have just a small handful of layers(3-layer is quite big)
Another way is that you could have a bunch of deep layers after 3rd layer that are not connected horizontally

[Source] https://www.coursera.org/learn/nlp-sequence-models

Share on

Twitter Facebook LinkedIn

Jeongho Shin (Leo)

[Google_Bootcamp_Day24]

Language model and sequence generation

Process of language modeling

RNN model

Loss function

Sampling a sequence from a trained RNN language model

Vanishing gradients problem with RNNs

Gated Recurrent Unit (GRU)

Long short-term memory unit (LSTM)

Bidirectional RNN

Deep RNNs

Share on

Leave a comment

You may also enjoy

[Network] Transport Layer 2

[Network] Transport Layer 1

[Web Backend] SQL Basic / JDBC

[Data Structure] Linked_List