a bit on word embeddings

Lately I’ve been working almost exclusively on continuous word representations, with the goal of finding vectorial representations of words which expose semantic and/or syntactic relationships between words.

As is typical for any interesting machine learning problem, there are a glut of clever models based on various assumptions (sparsity, hierarchical sparsity, low-rankedness, etc.) that yield respectable embeddings. Arguably, however, the most well known of these representations are the word2vec models due to Mikolov et al., which are part of a larger class of neural network-based methods for learning word representations. This is largely due to the facts that the authors released an efficient implementation of their algorithm, and demonstrated that the embeddings they calculate exhibit nice algebraic properties (e.g., ‘king’ – ‘man’ + ‘woman’ is approximately ‘queen’).

I’m partial to the word2vec (skip-gram) model precisely because it is based on a simple and sensible (although contested) linguistic axiom known as the distribution hypothesis— that words that appear in contexts with similar words tend to have similar meanings— and in its purest form has a simple probabilistic interpretation that makes its application amenable to nontrivial theoretical investigation (for instance, it’s now clear to me from whence the mysterious algebraic properties of the fitted word embeddings come). This model involves parameterizing the probability distribution of the words in the neighborhood of each word in the model. Semantically similar words end up having similar vectorial embeddings, precisely because the distribution of words in their neighborhoods are similar.

Unfortunately, the pure skip-gram model is not easily fit, because the calculation of the partition function involves a sum over all the words in the vocabulary, which can easily be on the order of millions. The authors of the word2vec paper mitigate this situation by introducing two approximations to this exact model: one is a variational approximation that finds the best approximation with a tree-structured class of probability distributions that are more computationally tractable, and one is an approximation inspired by the principle of noise contrastive estimation— that a good model only has to be able to distinguish between true observations and observations drawn from a sufficiently plausible noise distribution. The latter model is very similar in spirit to the model of Minh et al.

I’m curious as to what extent these approximate models lose semantic information that would be captured in the pure model, if it could be fit. It is entirely possible that the ‘approximations’ are themselves more useful models than the intractable pure model. One of the projects that I’m working on with my intern (Jiyan Yang) is attempting to experimentally quantify this, using a combination of smaller datasets and various methods of formulating the optimization problem (e.g., stochastic methods and score matching-type ideas) to make it feasible to fit the exact model.

We have more ambition than to understand the word2vec model. It seems that a large class of word embeddings can be considered to be attempting to fit probability distributions over the neighbors of a word, just as word2vec does, so I’m trying to figure out a universal framework for probabilistic word embeddings. It turns out that this has strong connections to several other applications in machine learning, so the understanding garnered here can be used in, e.g., collaborative filtering.