Apache Mahout is a “scalable machine learning library” which, among others, contains implementations of various single-node and distributed recommendation algorithms. In my last blog post, I described how to implement an on-line recommender system processing data on a single node. What if the data is too large to fit into memory (>100M preference data points)? Then we have no choice but to take a look at Mahout’s distributed recommenders implementation!
The distributed recommender is based on Apache Hadoop; it’s a job which takes as input a list of user preferences, computes an item co-occurence matrix, and outputs top-K recommendations for each user. For an introductory blog on how this works and how to run it locally, see for example this blog post.
We can, of course, run this job on a custom Hadoop cluster, but it’s much faster (and less painful) to just use a pre-configured one, like EMR. However, there’s a slight problem. The latest Hadoop version that is available on EMR is 1.0.3, and it contains jars for Apache Lucene 2.9.4. However, the recommender job depends on Lucene 4.3.0, which results in the following beautiful stack trace: