[Google_Bootcamp_Day25]

Updated:

Word representation

onehot

Problem of one-hot encoding

  • it doesn’t allow an algorithm to easily generalize words
  • inner product between any two different one-hot vector is zero
  • Example:
    • sentence 1: I want a glass of orange ______ (blank : juice)
    • sentecne 2: I want a glass of apple ______
    • With using one-hot encoding, there is no relation between “orange” and “apple” so that it is hard to learn the blank of sentence 2 is “juice”.

Featurized representation

feature

  • ex. represent words as 300-dimensional vector
  • representations for “orange” and “apple” are now quite similar
  • allows it to generalize better across different words

Visualizaing word embeddings (ex. t-SNE)

300D -> 2D tsne

Using word embeddings

  1. Learn word embeddings from large text corpus (1-100B words) or download pre-trained embedding online
  2. Transfer embedding to new task with smaller training set (ex. 100K words)
  3. Optional: Continue to fine-tune the word embeddings with new data (when data size of new dataset is big enough)

Relation to face encoding

face

  • “Encoding” and “Embedding” has almost the same meaning
  • In the case of “face encoding”: train a neural network that can take as input any face picture even if it is new images, then compute an encoding for that new picture
  • In the case of “word embedding”: have a fixed vocabulary and just learns a fixed embedding for each of the words in our vocabulary

Analogy reasoning

feature_vector plot

  • one of the remarkable results about word embeddings is the generality of analogy relationships they can learn

Cosine similarity

cos func

  • if you learn a set of word embeddings and find a word w that maximizes this type of similarity, you can actually get the exact right answer

How to learn word embedding (Embedding matrix)

embedding

  • Assume the size of vocabulary is 10,000 (10K), and the word “orange” is 6257th
  • Initialize E randomly and learn all the parameters of this 300 by 10,000 dimensional matrix, then E times this one-hot vector gives you the embedding vector
  • In practice, it is not efficient to actually implement this as a mass matrix vector multiplication because the one-hot vectors, now this is a relatively high dimensional vector and most of these elements are zero
  • Use a specialized function to just look up a column of the Matrix E rather than do this with the matrix multiplication

[Source] https://www.coursera.org/learn/nlp-sequence-models

Categories:

Updated:

Leave a comment