I'm getting ClassNotFoundException errors when running inside Hadoop's map phase, unable to find my class org.apache.hadoop.chukwa.extraction.demux.processor.mapper.XmlBasedDemux which I've packaged in a JAR named data-collection-demux-0.1.jar.
The problem seems to be in the values of these two properties in the Hadoop job configuration:
The problem seems to stem from the fact that the call to DistributedCache.addFileToClassPath is passing in a Path that is in URI form, i.e. hdfs://localhost:9000/chukwa/demux/data-collection-demux-0.1.jar whereas the DistributedCache API expects it to be a filesystem-based path (i.e. /chukwa/demux/data-collection-demux-0.1.jar). I'm not sure why, but the FileStatus object returned by FileSystem.listStatus is returning a URL-based path instead of a filesystem-based path.
I kludged the Demux class' addParsers to strip the "hdfs://localhost:9000" portion of the string and now my class is found. I will attempt to provide a patch today that determines the value of Hadoop's fs.default.name and strips that from the value returned in Demux.java.