Word Representation

Credits

Original authors: Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean.
Code Website: Word2Vec
Explained Paper: Neural Word Embedding as Implicit Matrix Factorization. Omer Levy, Yoav Goldberg
Explained Paper Short Version: Word2vec Explained. Yoav Goldberg, Omer Levy
Code easier to understand: Keras Word2vec. Francois Chollet.
TensorFlow code: TensorFlow word2vec

If your time is in short supply, just read the Explained Paper Short Version.

Goals

English language has in the order of 100,000 words. If we are working on an NLP problem, one can represent each word as a one-hot vector of dimension 100,000. This is a sparse and high dimension input. Our goal is to map this into dense low dimensional input of around 300 dimension, $ v _w$. Then we can feed this to some other model like LSTM for some NLP tasks. Hopefully, our new representation respects some semantic rules like

Words with similar meaning should be close to each other
The direction from “do” to “doing” should be similar to “go” to “going”
Doing addition: King - male + female ~ queen

It is surprising that word2vec can give us this, without any label on the meaning of words. All it needs is words cooccurance. It is also surpsisingly simple, nothing more complicated than logistic regression.

Ideas of Word2vec

Given a sentence:

You used to call me on my cell phone.

and a window k (consider k = 1 for simplicity), we define the context of a word as its 2k neighbor words. Word2vec defines the positive dataset $ \mathcal{D} $ of

word	context
you	used
used	you
used	to
to	used
to	call

Each of these pair did appear in the dataset, so we associate them with a label 1. We now define the negative word pairs dataset $ \mathcal{D’} $

word	non_context
you	random_word1
used	random_word2
to	random_word3
call	random_word4

and label these pair as 0 (Note that when we pick one random word from the vocabulary, there is some tiny chance that the picked word is actually a valid context, but it is very small that we consider it 0).

We then use the logistic loss to train a vector representation for each word such as it maximize $\arg \max _{\theta} \prod _{(w,c) \in \mathcal{D}} p(D _{w,c} = 1 \mid w,c,\theta) \prod _{(w,c) \in \mathcal{D'}} p(D _{w,c} = 0 \mid w,c,\theta).$ So we basically maximize the probability of seeing those word pairs in $ \mathcal{D}$, and not seeing those word pairs in $ \mathcal{D’}$. Taking log, we can rewrite the optimization as $\arg \max _{\theta} \sum _{(w,c) \in \mathcal{D }} \log \sigma( v _c \cdot v _w) + \sum _{(w,c) \in \mathcal{D'}} \log \sigma(- v _c \cdot v _w),$ for $ \sigma(x) = \frac{1}{1 + e ^ {-x}} $

Now to be clear, all of the $ v _w$ are our vector representation of word, together, they form a matrix of size (vocabulary size by projection size), e.g. (100,000 by 300). All of the $ v _c $ are our vector representation of context, together, they form a matrix of similar size if we have one negative sample per positive sample. In practice this later matrix is often discarded. We basically optimize with respect to those two matrix, such that the product $ v _c \cdot v _w$ is big for those [word, context] we see in our dataset, and small for those [word, context] we do not see in our dataset.

Enhancement

The final method used is a bit more complicated, with tricks that make it work better, for example

Instead of sampling one negative context per [word, context] pair uniformly, it samples $m$ context words with probability distribution proportional to how often each context word is in the dataset.
Instead of using a fix $ k$ window around each word, the window is uniformly distributed from $ 1,2, …, K $
Treat all rare words as an “UNK” token, and downsampling most common words.
The method we mention so far is called the Skip-Gram Negative Sampling, the original paper also mentions the Continuous Bag of Words, where it models the probability of word given context. The authors claim that the skip-gram negative sampling works better for bigger dataset (see more in the TensorFlow example).

Extra

This word representation, finding the two vector (matrix) representation $ V _w$ and $ V _c$ can be thought of as factorizing an implicit matrix $ M = V _w \cdot V _c ^ T $, where each element of $ M, M _{ij}$ reflect the strength of association between word $ i $ and context $ j$. More specifically, it is found that $M _{ij} = v _w \cdot v _c = PMI(w _i, c _j) - \log k$ , for PMI is the pointwise mutual information, defined as $PMI(w _i, c _j) = \log \left( \frac { N(w,c) |\mathcal{D}|}{ N(w) N(c) } \right)$ , for $ N(w)$ counts the number of occurance of $w $ in $ \mathcal{D}$

Testing out

The official website of word2vec has a very fast code in C++, where one can test things out pretty easily. Head over there, training the model should take five minutes. It even has the option of download pre-trained models. Here I’m using there pre-trained model on Google News words, and find similar words to a word (think of this like a Thesaurus)

Word: eminem Position in vocabulary: 566691

        Word       Cosine distance --------------------------------------
       jay z		0.732697
   rick_ross		0.731509
       kanye		0.715282
        shyt		0.705407
 chris brown		0.697447
       i luv		0.694622
   lady gaga		0.690142
  john mayer		0.686606
        ozzy		0.678592
  soulja boy		0.671136
     rihanna		0.670854
        juss		0.670568
   lil wayne		0.669848
     beyonce		0.667990
       cuz u		0.664925
      mariah		0.664813