Word Representation - Word2Vec
Credits
- Original authors: Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean.
- Code Website: Word2Vec
- Explained Paper: Neural Word Embedding as Implicit Matrix Factorization. Omer Levy, Yoav Goldberg
- Explained Paper Short Version: Word2vec Explained. Yoav Goldberg, Omer Levy
- Code easier to understand: Keras Word2vec. Francois Chollet.
- TensorFlow code: TensorFlow word2vec
If your time is in short supply, just read the Explained Paper Short Version.
Goals
English language has in the order of 100,000 words. If we are working on an NLP problem, one can represent each word as a one-hot vector of dimension 100,000. This is a sparse and high dimension input. Our goal is to map this into dense low dimensional input of around 300 dimension, \( v _w\). Then we can feed this to some other model like LSTM for some NLP tasks. Hopefully, our new representation respects some semantic rules like
- Words with similar meaning should be close to each other
- The direction from “do” to “doing” should be similar to “go” to “going”
- Doing addition: King - male + female ~ queen
It is surprising that word2vec can give us this, without any label on the meaning of words. All it needs is words cooccurance. It is also surpsisingly simple, nothing more complicated than logistic regression.
Ideas of Word2vec
Given a sentence:
You used to call me on my cell phone.
and a window k (consider k = 1 for simplicity), we define the context of a word as its 2k neighbor words. Word2vec defines the positive dataset \( \mathcal{D} \) of
word | context |
---|---|
you | used |
used | you |
used | to |
to | used |
to | call |
Each of these pair did appear in the dataset, so we associate them with a label 1. We now define the negative word pairs dataset \( \mathcal{D’} \)
word | non_context |
---|---|
you | random_word1 |
used | random_word2 |
to | random_word3 |
call | random_word4 |
and label these pair as 0 (Note that when we pick one random word from the vocabulary, there is some tiny chance that the picked word is actually a valid context, but it is very small that we consider it 0).
We then use the logistic loss to train a vector representation for each word such as it maximize So we basically maximize the probability of seeing those word pairs in \( \mathcal{D}\), and not seeing those word pairs in \( \mathcal{D’}\). Taking log, we can rewrite the optimization as for \( \sigma(x) = \frac{1}{1 + e ^ {-x}} \)
Now to be clear, all of the \( v _w\) are our vector representation of word, together, they form a matrix of size (vocabulary size by projection size), e.g. (100,000 by 300). All of the \( v _c \) are our vector representation of context, together, they form a matrix of similar size if we have one negative sample per positive sample. In practice this later matrix is often discarded. We basically optimize with respect to those two matrix, such that the product \( v _c \cdot v _w\) is big for those [word, context] we see in our dataset, and small for those [word, context] we do not see in our dataset.
Enhancement
The final method used is a bit more complicated, with tricks that make it work better, for example
- Instead of sampling one negative context per [word, context] pair uniformly, it samples \(m\) context words with probability distribution proportional to how often each context word is in the dataset.
- Instead of using a fix \( k\) window around each word, the window is uniformly distributed from \( 1,2, …, K \)
- Treat all rare words as an “UNK” token, and downsampling most common words.
- The method we mention so far is called the Skip-Gram Negative Sampling, the original paper also mentions the Continuous Bag of Words, where it models the probability of word given context. The authors claim that the skip-gram negative sampling works better for bigger dataset (see more in the TensorFlow example).
Extra
This word representation, finding the two vector (matrix) representation \( V _w\) and \( V _c\) can be thought of as factorizing an implicit matrix \( M = V _w \cdot V _c ^ T \), where each element of \( M, M _{ij}\) reflect the strength of association between word \( i \) and context \( j\). More specifically, it is found that , for PMI is the pointwise mutual information, defined as , for \( N(w)\) counts the number of occurance of \(w \) in \( \mathcal{D}\)
Testing out
The official website of word2vec has a very fast code in C++, where one can test things out pretty easily. Head over there, training the model should take five minutes. It even has the option of download pre-trained models. Here I’m using there pre-trained model on Google News words, and find similar words to a word (think of this like a Thesaurus)
Word: eminem Position in vocabulary: 566691
Word Cosine distance -------------------------------------- jay z 0.732697 rick_ross 0.731509 kanye 0.715282 shyt 0.705407 chris brown 0.697447 i luv 0.694622 lady gaga 0.690142 john mayer 0.686606 ozzy 0.678592 soulja boy 0.671136 rihanna 0.670854 juss 0.670568 lil wayne 0.669848 beyonce 0.667990 cuz u 0.664925 mariah 0.664813