> Why not having a bundle artifact where all the Mahout submodules would be put it a single jar?
How is this not trivial for you to handle with maven?
If you are writing your own maven project (recommended), then jar-with-dependencies will do what you want.
If you are extending Mahout (ok for prototypes), just put your code in the examples job jar and all will be good.
I am not extending Mahout and as you've probably seen in the comments above the point is to be able to generate Mahout data structures from Behemoth so putting the code in examples is not an option anyway.
Back to the original problem. I generate a job file for my Mahout module in Behemoth (https://github.com/jnioche/behemoth/tree/master/modules/mahout) and manage the dependencies with Ivy. The main class (SparseVectorsFromBehemoth) is a slightly modified version of SparseVectorsFromSequenceFiles which gets the Tokens from Behemoth documents instead of using Lucene and generates the data structures expected by the classifiers and clusterers.
The job file contains :
- the Behemoth classes for the Mahout module
- the dependencies in /lib including
The problem I had was the same as Han Hui Wen (
MAHOUT-368) i.e I was getting a class not found exception on org.apache.mahout.math.VectorWritable. My understanding of the problem is that my main class calls DictionaryVectorizer which in my job file was in lib/mahout-core-0.4.jar and this has a dependency on VectorWritable which is in lib/mahout-maths-0.4.jar. For some reason MapReduce was not able to find VectorWritable, which I assume has to do with the jobs in DictionaryVectorizer calling 'job.setJarByClass(DictionaryVectorizer.class)'.
I could of course use jar-with-dependencies on the Mahout code and generate a single jar then manage the jar locally. However this means that I have very little control over the dependencies used by Mahout (e.g. potentially conflicting versions with other components in my job files) and I'd rather rely on external publicised jars anyway. A better option would be to simply unpack the content of the mahout core and maths jars into the root of my job file. At least the Mahout dependencies would be handled and versioned normally.
I've tried with Hadoop 0.21.0 and did not get this issue so I suppose that something must have changed in the way the classloader handles dependencies within a job file.