[Google_Bootcamp_Day24]

Updated:

Language model and sequence generation

P(y<1>, y<2>, y<3>, …, y) = ?

  • Language model estimates the probability of that particular sequence of words
  • Language model will be useful to represent sentences as outputs y rather than inputs x
  • Training set : large corpus of text

Process of language modeling

  1. Tokenize
  2. Map each token to vocabulary set (one-hot encoding)
  3. If not in vocabulary set, then change to UNK (unknown words)

RNN model

rnn

  • Assume the example sentence is “Cats average 15 hours of sleep a day. "
  • a<1> make a softmax preiction to try to figure out what is the probability of the first words
  • y<2> = P ( _______ “cats”)
  • y<3> = P ( _______ “cats average”)
  • P(y<1>, y<2>, y<3>) = P(y<1>) * P(y<2> y<1>) * P(y<3> y<1>, y<2>)

Loss function

loss

Sampling a sequence from a trained RNN language model

sequence_model

  • sample from Ttrained model’s distribution to generate noble sequences of words.
  • randomly sample according to this soft max distribution
  • keep sampling until you generate an EOS token.
  • Also could be character-level RNN as well as word-level RNN
    • pros : don’t ever have to worry about unknown word tokens
    • cons : end up with much longer sequences so not as good as word level language models at capturing long range dependencies between how the the earlier parts of the sentence also affect the later part of the sentence

Vanishing gradients problem with RNNs

  • baisc RNNs are not very good at capturing very long-term dependencies (similar to deep-layer NN)

Gated Recurrent Unit (GRU)

RNN unit rnn

GRU unit gru gru2

  • element-wise multiplication tells your GRU which are the dimensions of your memory cell vector to update at every time-step, so you can choose to keep some bits constant while updating other bits.

Full GRU full_gru

  • gate gamma r tells you how relevant is c to computing the next candidate for c
  • have longer range connections, and also address vanishing gradient problems

Long short-term memory unit (LSTM)

lstm_fig lstm_formula

  • slightly more powerful and more general version of the GRU
  • separte forget gate and update gate

Connect several LSTM units lstm_connect

Bidirectional RNN

Problem of unidirectional RNN

  • EX1. He said, “Teddy bears are on sale!”
  • EX2. He said, “Teddy Roosevelt was a great President!”
  • not enough to just look at the first part of the sentence to figure out whether the third word “Teddy” is a part of the person’s name

Solution = Bidirectional RNN bidrectional_rnn

  • ex. in the case of “y_hat<3>”, input x<1>, x<2>, x<3> comes from forward recurrent components(blue) and x<4> comes from backward recurrent components(red)
  • bidirectional RNN with a LSTM units appears to be commonly used
  • cons : need the entire sequence of data before you can make predictions anywhere

Deep RNNs

deep_rnn

  • deep RNNs are quite computationally expensive to train
  • Because of the temporal dimension, these networks can already get quite big even if you have just a small handful of layers(3-layer is quite big)
  • Another way is that you could have a bunch of deep layers after 3rd layer that are not connected horizontally

[Source] https://www.coursera.org/learn/nlp-sequence-models

Categories:

Updated:

Leave a comment