[Google_Bootcamp_Day24]
Updated:
Language model and sequence generation
P(y<1>, y<2>, y<3>, …, y
- Language model estimates the probability of that particular sequence of words
- Language model will be useful to represent sentences as outputs y rather than inputs x
- Training set : large corpus of text
Process of language modeling
- Tokenize
- Map each token to vocabulary set (one-hot encoding)
- If not in vocabulary set, then change to UNK (unknown words)
RNN model
- Assume the example sentence is “Cats average 15 hours of sleep a day.
" - a<1> make a softmax preiction to try to figure out what is the probability of the first words
-
y<2> = P ( _______ “cats”) -
y<3> = P ( _______ “cats average”) -
P(y<1>, y<2>, y<3>) = P(y<1>) * P(y<2> y<1>) * P(y<3> y<1>, y<2>)
Loss function
Sampling a sequence from a trained RNN language model
- sample from Ttrained model’s distribution to generate noble sequences of words.
- randomly sample according to this soft max distribution
- keep sampling until you generate an EOS token.
- Also could be character-level RNN as well as word-level RNN
- pros : don’t ever have to worry about unknown word tokens
- cons : end up with much longer sequences so not as good as word level language models at capturing long range dependencies between how the the earlier parts of the sentence also affect the later part of the sentence
Vanishing gradients problem with RNNs
- baisc RNNs are not very good at capturing very long-term dependencies (similar to deep-layer NN)
Gated Recurrent Unit (GRU)
RNN unit
GRU unit
- element-wise multiplication tells your GRU which are the dimensions of your memory cell vector to update at every time-step, so you can choose to keep some bits constant while updating other bits.
Full GRU
- gate gamma r tells you how relevant is c
to computing the next candidate for c - have longer range connections, and also address vanishing gradient problems
Long short-term memory unit (LSTM)
- slightly more powerful and more general version of the GRU
- separte forget gate and update gate
Connect several LSTM units
Bidirectional RNN
Problem of unidirectional RNN
- EX1. He said, “Teddy bears are on sale!”
- EX2. He said, “Teddy Roosevelt was a great President!”
- not enough to just look at the first part of the sentence to figure out whether the third word “Teddy” is a part of the person’s name
Solution = Bidirectional RNN
- ex. in the case of “y_hat<3>”, input x<1>, x<2>, x<3> comes from forward recurrent components(blue) and x<4> comes from backward recurrent components(red)
- bidirectional RNN with a LSTM units appears to be commonly used
- cons : need the entire sequence of data before you can make predictions anywhere
Deep RNNs
- deep RNNs are quite computationally expensive to train
- Because of the temporal dimension, these networks can already get quite big even if you have just a small handful of layers(3-layer is quite big)
- Another way is that you could have a bunch of deep layers after 3rd layer that are not connected horizontally
[Source] https://www.coursera.org/learn/nlp-sequence-models
Leave a comment