[Google_Bootcamp_Day25]
Updated:
Word representation
Problem of one-hot encoding
- it doesn’t allow an algorithm to easily generalize words
- inner product between any two different one-hot vector is zero
- Example:
- sentence 1: I want a glass of orange ______ (blank : juice)
- sentecne 2: I want a glass of apple ______
- With using one-hot encoding, there is no relation between “orange” and “apple” so that it is hard to learn the blank of sentence 2 is “juice”.
Featurized representation
- ex. represent words as 300-dimensional vector
- representations for “orange” and “apple” are now quite similar
- allows it to generalize better across different words
Visualizaing word embeddings (ex. t-SNE)
300D -> 2D
Using word embeddings
- Learn word embeddings from large text corpus (1-100B words) or download pre-trained embedding online
- Transfer embedding to new task with smaller training set (ex. 100K words)
- Optional: Continue to fine-tune the word embeddings with new data (when data size of new dataset is big enough)
Relation to face encoding
- “Encoding” and “Embedding” has almost the same meaning
- In the case of “face encoding”: train a neural network that can take as input any face picture even if it is new images, then compute an encoding for that new picture
- In the case of “word embedding”: have a fixed vocabulary and just learns a fixed embedding for each of the words in our vocabulary
Analogy reasoning
- one of the remarkable results about word embeddings is the generality of analogy relationships they can learn
Cosine similarity
- if you learn a set of word embeddings and find a word w that maximizes this type of similarity, you can actually get the exact right answer
How to learn word embedding (Embedding matrix)
- Assume the size of vocabulary is 10,000 (10K), and the word “orange” is 6257th
- Initialize E randomly and learn all the parameters of this 300 by 10,000 dimensional matrix, then E times this one-hot vector gives you the embedding vector
- In practice, it is not efficient to actually implement this as a mass matrix vector multiplication because the one-hot vectors, now this is a relatively high dimensional vector and most of these elements are zero
- Use a specialized function to just look up a column of the Matrix E rather than do this with the matrix multiplication
[Source] https://www.coursera.org/learn/nlp-sequence-models
Leave a comment