## a bit on word embeddings

### July 25, 2014

Lately I’ve been working almost exclusively on continuous word representations, with the goal of finding vectorial representations of words which expose semantic and/or syntactic relationships between words.

As is typical for any interesting machine learning problem, there are a glut of clever models based on various assumptions (sparsity, hierarchical sparsity, low-rankedness, etc.) that yield respectable embeddings. Arguably, however, the most well known of these representations are the word2vec models due to Mikolov et al., which are part of a larger class of neural network-based methods for learning word representations. This is largely due to the facts that the authors released an efficient implementation of their algorithm, and demonstrated that the embeddings they calculate exhibit nice algebraic properties (e.g., ‘king’ – ‘man’ + ‘woman’ is approximately ‘queen’).

I’m partial to the word2vec (skip-gram) model precisely because it is based on a simple and sensible (although contested) linguistic axiom known as the distribution hypothesis— that words that appear in contexts with similar words tend to have similar meanings— *and* in its purest form has a simple probabilistic interpretation that makes its application amenable to nontrivial theoretical investigation (for instance, it’s now clear to me from whence the mysterious algebraic properties of the fitted word embeddings come). This model involves parameterizing the probability distribution of the words in the neighborhood of each word in the model. Semantically similar words end up having similar vectorial embeddings, precisely because the distribution of words in their neighborhoods are similar.

Unfortunately, the pure skip-gram model is not easily fit, because the calculation of the partition function involves a sum over all the words in the vocabulary, which can easily be on the order of millions. The authors of the word2vec paper mitigate this situation by introducing two approximations to this exact model: one is a variational approximation that finds the best approximation with a tree-structured class of probability distributions that are more computationally tractable, and one is an approximation inspired by the principle of noise contrastive estimation— that a good model only has to be able to distinguish between true observations and observations drawn from a sufficiently plausible noise distribution. The latter model is very similar in spirit to the model of Minh et al.

I’m curious as to what extent these approximate models lose semantic information that would be captured in the pure model, if it could be fit. It is entirely possible that the ‘approximations’ are themselves more useful models than the intractable pure model. One of the projects that I’m working on with my intern (Jiyan Yang) is attempting to experimentally quantify this, using a combination of smaller datasets and various methods of formulating the optimization problem (e.g., stochastic methods and score matching-type ideas) to make it feasible to fit the exact model.

We have more ambition than to understand the word2vec model. It seems that a large class of word embeddings can be considered to be attempting to fit probability distributions over the neighbors of a word, just as word2vec does, so I’m trying to figure out a universal framework for probabilistic word embeddings. It turns out that this has strong connections to several other applications in machine learning, so the understanding garnered here can be used in, e.g., collaborative filtering.

I’m trying to use LDA on a large amount of data. A quick recap:

- Tried vowpal wabbit … it’s fast, I’ll give it that, but it’s also useless: the output is dubious (what I think are the topics look like they haven’t changed very much from the prior) *and* I have no idea how it maps onto topics and documents (the documentation is AWFUL, and the dimensions of the output files are WONKY).
- Tried two implementations of SCVB0, a stochastic collapsed variational bayes LDA algorithm: one doesn’t work at all (as in, it stalls on any amount of data — so crap), and the other works fine and fast on medium amounts of data but apparently requires more than 5 days on my size dataset (I quit after 5 days; it’s worth noting that this code was removed from github literally the next day after I downloaded it — so dubious).
- For whatever reason, Mallet can’t even import (using import-file) a relatively small amount of data: the import process takes forever and then quits with an error — so crap.
- Now I’m trying to use Mr.LDA, a map-reduce implementation. Given my experience so far, I am not holding out much hope, but when one must… The issue with Mr.LDA is that when I’m importing my data on the eBay Hadoop cluster, the mappers run out of heap memory during the tokenization process.

After spending of most of today trying to figure out how to increase the heap memory, I decided to just install Hadoop locally, where I can control the heap size, and tokenize my data, then transfer the tokenized data to eBay’s HDFS and run the actual LDA job on the cluster. Ugh.

First I have to get Hadoop running locally. I followed the DigitalOcean tutorial, which leaves out crucial details about permissions. To address this oversight is the entire point of this post (that and to exorcise my crappy-implementations-of-LDA-induced demons).

I installed everything to run from my user account, as opposed to root. Specifically, I untarred the hadoop directory using my user account, then moved it to /etc with sudo, so the entire hadoop directory structure is under my username and login group. Then, after creating the hadoop_store directory as root, I chowned that entire subtree over to my user account with ‘chown -R uname: /usr/local/hadoop_store’. Then I followed the directions as provided and got a working hadoop system.