I’m trying to use LDA on a large amount of data. A quick recap:
- Tried vowpal wabbit … it’s fast, I’ll give it that, but it’s also useless: the output is dubious (what I think are the topics look like they haven’t changed very much from the prior) *and* I have no idea how it maps onto topics and documents (the documentation is AWFUL, and the dimensions of the output files are WONKY).
- Tried two implementations of SCVB0, a stochastic collapsed variational bayes LDA algorithm: one doesn’t work at all (as in, it stalls on any amount of data — so crap), and the other works fine and fast on medium amounts of data but apparently requires more than 5 days on my size dataset (I quit after 5 days; it’s worth noting that this code was removed from github literally the next day after I downloaded it — so dubious).
- For whatever reason, Mallet can’t even import (using import-file) a relatively small amount of data: the import process takes forever and then quits with an error — so crap.
- Now I’m trying to use Mr.LDA, a map-reduce implementation. Given my experience so far, I am not holding out much hope, but when one must… The issue with Mr.LDA is that when I’m importing my data on the eBay Hadoop cluster, the mappers run out of heap memory during the tokenization process.
After spending of most of today trying to figure out how to increase the heap memory, I decided to just install Hadoop locally, where I can control the heap size, and tokenize my data, then transfer the tokenized data to eBay’s HDFS and run the actual LDA job on the cluster. Ugh.
First I have to get Hadoop running locally. I followed the DigitalOcean tutorial, which leaves out crucial details about permissions. To address this oversight is the entire point of this post (that and to exorcise my crappy-implementations-of-LDA-induced demons).
I installed everything to run from my user account, as opposed to root. Specifically, I untarred the hadoop directory using my user account, then moved it to /etc with sudo, so the entire hadoop directory structure is under my username and login group. Then, after creating the hadoop_store directory as root, I chowned that entire subtree over to my user account with ‘chown -R uname: /usr/local/hadoop_store’. Then I followed the directions as provided and got a working hadoop system.