Mahout
  1. Mahout
  2. MAHOUT-1006

Example from book no longer works - prepare20newsgroups broken with Lucene upgrade

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.7
    • Component/s: None
    • Labels:
      None

      Description

      The StandardAnalyzer from Lucene no longer has a no-args constructor. Our code uses reflection to create this class, but looks for a no-args constructor and that causes this:

      ./bin/mahout prepare20newsgroups -p 20news-bydate-train/ -o 20news-train/ -a org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8  
      MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
      no HADOOP_HOME set, running locally
      Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/Users/hadoop/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/Users/hadoop/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/Users/hadoop/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      Exception in thread "main" java.lang.IllegalStateException: java.lang.NoSuchMethodException: org.apache.lucene.analysis.standard.StandardAnalyzer.<init>()
      	at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:68)
      	at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:28)
      	at org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups.main(PrepareTwentyNewsgroups.java:89)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
      	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
      	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
      Caused by: java.lang.NoSuchMethodException: org.apache.lucene.analysis.standard.StandardAnalyzer.<init>()
      	at java.lang.Class.getConstructor0(Class.java:2706)
      	at java.lang.Class.getConstructor(Class.java:1657)
      	at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:62)
      	... 9 more
      

      This is really bad.

        Activity

        Hide
        Ted Dunning added a comment -

        Here is an (untested) patch.

        Show
        Ted Dunning added a comment - Here is an (untested) patch.
        Hide
        Ted Dunning added a comment -

        Untested patch. I will ask the original user to test.

        Show
        Ted Dunning added a comment - Untested patch. I will ask the original user to test.
        Hide
        Ted Dunning added a comment -

        Can somebody test and commit this? I am on the road and have limited access.

        Show
        Ted Dunning added a comment - Can somebody test and commit this? I am on the road and have limited access.
        Hide
        Grant Ingersoll added a comment -

        I've got this one, Ted.

        Show
        Grant Ingersoll added a comment - I've got this one, Ted.
        Hide
        Grant Ingersoll added a comment -

        Hmm, looks like removing the old bayes code was a bit too aggressive.

        Show
        Grant Ingersoll added a comment - Hmm, looks like removing the old bayes code was a bit too aggressive.
        Hide
        Grant Ingersoll added a comment -

        Robin is fixing some other things w/ NB, so he's going to take this.

        Show
        Grant Ingersoll added a comment - Robin is fixing some other things w/ NB, so he's going to take this.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1517 (See https://builds.apache.org/job/Mahout-Quality/1517/)
        MAHOUT-1006 making end to end example work (Revision 1345735)

        Result = FAILURE
        robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345735
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/AbstractNaiveBayesClassifier.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/ComplementaryNaiveBayesClassifier.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/test/BayesTestMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/ThetaMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainNaiveBayesJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/WeightsMapper.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1517 (See https://builds.apache.org/job/Mahout-Quality/1517/ ) MAHOUT-1006 making end to end example work (Revision 1345735) Result = FAILURE robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345735 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/AbstractNaiveBayesClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/ComplementaryNaiveBayesClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/test/BayesTestMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/ThetaMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainNaiveBayesJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/WeightsMapper.java
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1519 (See https://builds.apache.org/job/Mahout-Quality/1519/)
        MAHOUT-1006 Increase default heapsize to 4G and create deprecation warnings for old naivebayes (Revision 1345772)

        Result = FAILURE
        robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345772
        Files :

        • /mahout/trunk/bin/mahout
        • /mahout/trunk/core/src/main/java/org/apache/mahout/driver/MahoutDriver.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1519 (See https://builds.apache.org/job/Mahout-Quality/1519/ ) MAHOUT-1006 Increase default heapsize to 4G and create deprecation warnings for old naivebayes (Revision 1345772) Result = FAILURE robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345772 Files : /mahout/trunk/bin/mahout /mahout/trunk/core/src/main/java/org/apache/mahout/driver/MahoutDriver.java
        Hide
        Robin Anil added a comment -

        Phew! After weeks of relooking at the code, I finally figured out that the theta normalization driver is screwed up. The current code is somehow different Right now I do not have a solution for fixing that. However theta-normalization will only remove some 1-2% off accuracy. So I would not be too worried about that. The solution will just work with seq2sparse and tfidf vectors. It assumes that input sequence file of vectors are named in the format "/class-name/filename and it will expect this to be the case if used otherwise. Sorry I dont have a better representation for classname and vector name in the short amount of time so as to have to make this a working replacement for bayes for this release. I have gone ahead and put deprecation messages if people try to run prepare20newsgroups via commandline.

        So for a 80-20 Random split, the classifier gives 91% accuracy on 20newsgroups data. the example is added into classify-20newsgroups.sh

        If I get more time this week during buzzwords, I will try to fix the issue with thetanormalizer as well. But this should be release-able, even if I am not able to do that in time.

         
        
        Standard NB Results: =======================================================
        Summary
        -------------------------------------------------------
        Correctly Classified Instances          :       3357	   91.0991%
        Incorrectly Classified Instances        :        328	    8.9009%
        Total Classified Instances              :       3685
        
        =======================================================
        Confusion Matrix
        -------------------------------------------------------
        a    	b    	c    	d    	e    	f    	g    	h    	i    	j    	k    	l    	m    	n    	o    	p    	q    	r    	s    	t    	<--Classified as
        159  	0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	1    	0    	0    	4    	0    	 |  165   	a     = alt.atheism
        0    	155  	0    	8    	3    	6    	3    	0    	0    	0    	0    	1    	2    	0    	2    	0    	0    	0    	0    	0    	 |  180   	b     = comp.graphics
        0    	26   	104  	36   	7    	6    	1    	0    	0    	0    	0    	0    	1    	0    	2    	0    	0    	0    	0    	0    	 |  183   	c     = comp.os.ms-windows.misc
        0    	4    	2    	139  	11   	0    	5    	0    	0    	0    	0    	1    	2    	0    	0    	0    	0    	0    	0    	0    	 |  164   	d     = comp.sys.ibm.pc.hardware
        0    	2    	1    	2    	165  	0    	3    	0    	0    	0    	0    	1    	3    	0    	0    	0    	0    	0    	0    	0    	 |  177   	e     = comp.sys.mac.hardware
        1    	13   	0    	5    	2    	175  	3    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	 |  199   	f     = comp.windows.x
        0    	2    	0    	5    	0    	0    	168  	3    	0    	2    	2    	0    	1    	0    	0    	0    	0    	0    	1    	0    	 |  184   	g     = misc.forsale
        0    	0    	0    	1    	2    	0    	2    	182  	3    	0    	0    	0    	4    	0    	0    	1    	0    	0    	0    	0    	 |  195   	h     = rec.autos
        0    	0    	0    	0    	0    	1    	5    	2    	199  	0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	0    	 |  208   	i     = rec.motorcycles
        0    	0    	0    	0    	0    	0    	1    	0    	0    	177  	1    	0    	0    	0    	0    	0    	0    	0    	0    	0    	 |  179   	j     = rec.sport.baseball
        0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	183  	0    	0    	0    	0    	0    	0    	0    	0    	1    	 |  185   	k     = rec.sport.hockey
        0    	1    	0    	0    	0    	3    	0    	1    	0    	1    	0    	193  	0    	2    	0    	0    	0    	1    	1    	2    	 |  205   	l     = sci.crypt
        0    	3    	0    	9    	4    	2    	3    	1    	0    	0    	1    	2    	171  	0    	0    	0    	0    	0    	0    	0    	 |  196   	m     = sci.electronics
        0    	2    	1    	1    	0    	0    	1    	0    	0    	0    	0    	0    	1    	190  	2    	0    	0    	0    	0    	0    	 |  198   	n     = sci.med
        0    	3    	0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	2    	0    	190  	0    	0    	0    	2    	1    	 |  199   	o     = sci.space
        4    	1    	0    	1    	1    	0    	0    	0    	0    	0    	1    	0    	0    	1    	0    	212  	0    	0    	1    	0    	 |  222   	p     = soc.religion.christian
        0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	170  	1    	0    	1    	 |  172   	q     = talk.politics.mideast
        0    	0    	1    	0    	0    	0    	1    	0    	1    	0    	0    	2    	0    	0    	0    	0    	0    	165  	0    	5    	 |  175   	r     = talk.politics.guns
        14   	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	8    	0    	3    	116  	5    	 |  146   	s     = talk.religion.misc
        0    	0    	0    	0    	0    	0    	0    	0    	0    	2    	0    	0    	1    	0    	0    	0    	2    	3    	1    	144  	 |  153   	t     = talk.politics.misc
        
        Complementary Results: =======================================================
        Summary
        -------------------------------------------------------
        Correctly Classified Instances          :       3357	   91.0991%
        Incorrectly Classified Instances        :        328	    8.9009%
        Total Classified Instances              :       3685
        
        =======================================================
        Confusion Matrix
        -------------------------------------------------------
        a    	b    	c    	d    	e    	f    	g    	h    	i    	j    	k    	l    	m    	n    	o    	p    	q    	r    	s    	t    	<--Classified as
        159  	0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	1    	0    	0    	4    	0    	 |  165   	a     = alt.atheism
        0    	155  	0    	8    	3    	6    	3    	0    	0    	0    	0    	1    	2    	0    	2    	0    	0    	0    	0    	0    	 |  180   	b     = comp.graphics
        0    	26   	104  	36   	7    	6    	1    	0    	0    	0    	0    	0    	1    	0    	2    	0    	0    	0    	0    	0    	 |  183   	c     = comp.os.ms-windows.misc
        0    	4    	2    	139  	11   	0    	5    	0    	0    	0    	0    	1    	2    	0    	0    	0    	0    	0    	0    	0    	 |  164   	d     = comp.sys.ibm.pc.hardware
        0    	2    	1    	2    	165  	0    	3    	0    	0    	0    	0    	1    	3    	0    	0    	0    	0    	0    	0    	0    	 |  177   	e     = comp.sys.mac.hardware
        1    	13   	0    	5    	2    	175  	3    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	 |  199   	f     = comp.windows.x
        0    	2    	0    	5    	0    	0    	168  	3    	0    	2    	2    	0    	1    	0    	0    	0    	0    	0    	1    	0    	 |  184   	g     = misc.forsale
        0    	0    	0    	1    	2    	0    	2    	182  	3    	0    	0    	0    	4    	0    	0    	1    	0    	0    	0    	0    	 |  195   	h     = rec.autos
        0    	0    	0    	0    	0    	1    	5    	2    	199  	0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	0    	 |  208   	i     = rec.motorcycles
        0    	0    	0    	0    	0    	0    	1    	0    	0    	177  	1    	0    	0    	0    	0    	0    	0    	0    	0    	0    	 |  179   	j     = rec.sport.baseball
        0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	183  	0    	0    	0    	0    	0    	0    	0    	0    	1    	 |  185   	k     = rec.sport.hockey
        0    	1    	0    	0    	0    	3    	0    	1    	0    	1    	0    	193  	0    	2    	0    	0    	0    	1    	1    	2    	 |  205   	l     = sci.crypt
        0    	3    	0    	9    	4    	2    	3    	1    	0    	0    	1    	2    	171  	0    	0    	0    	0    	0    	0    	0    	 |  196   	m     = sci.electronics
        0    	2    	1    	1    	0    	0    	1    	0    	0    	0    	0    	0    	1    	190  	2    	0    	0    	0    	0    	0    	 |  198   	n     = sci.med
        0    	3    	0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	2    	0    	190  	0    	0    	0    	2    	1    	 |  199   	o     = sci.space
        4    	1    	0    	1    	1    	0    	0    	0    	0    	0    	1    	0    	0    	1    	0    	212  	0    	0    	1    	0    	 |  222   	p     = soc.religion.christian
        0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	170  	1    	0    	1    	 |  172   	q     = talk.politics.mideast
        0    	0    	1    	0    	0    	0    	1    	0    	1    	0    	0    	2    	0    	0    	0    	0    	0    	165  	0    	5    	 |  175   	r     = talk.politics.guns
        14   	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	8    	0    	3    	116  	5    	 |  146   	s     = talk.religion.misc
        0    	0    	0    	0    	0    	0    	0    	0    	0    	2    	0    	0    	1    	0    	0    	0    	2    	3    	1    	144  	 |  153   	t     = talk.politics.misc
        
        
        
        
        Show
        Robin Anil added a comment - Phew! After weeks of relooking at the code, I finally figured out that the theta normalization driver is screwed up. The current code is somehow different Right now I do not have a solution for fixing that. However theta-normalization will only remove some 1-2% off accuracy. So I would not be too worried about that. The solution will just work with seq2sparse and tfidf vectors. It assumes that input sequence file of vectors are named in the format "/class-name/filename and it will expect this to be the case if used otherwise. Sorry I dont have a better representation for classname and vector name in the short amount of time so as to have to make this a working replacement for bayes for this release. I have gone ahead and put deprecation messages if people try to run prepare20newsgroups via commandline. So for a 80-20 Random split, the classifier gives 91% accuracy on 20newsgroups data. the example is added into classify-20newsgroups.sh If I get more time this week during buzzwords, I will try to fix the issue with thetanormalizer as well. But this should be release-able, even if I am not able to do that in time. Standard NB Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 3357 91.0991% Incorrectly Classified Instances : 328 8.9009% Total Classified Instances : 3685 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 159 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 4 0 | 165 a = alt.atheism 0 155 0 8 3 6 3 0 0 0 0 1 2 0 2 0 0 0 0 0 | 180 b = comp.graphics 0 26 104 36 7 6 1 0 0 0 0 0 1 0 2 0 0 0 0 0 | 183 c = comp.os.ms-windows.misc 0 4 2 139 11 0 5 0 0 0 0 1 2 0 0 0 0 0 0 0 | 164 d = comp.sys.ibm.pc.hardware 0 2 1 2 165 0 3 0 0 0 0 1 3 0 0 0 0 0 0 0 | 177 e = comp.sys.mac.hardware 1 13 0 5 2 175 3 0 0 0 0 0 0 0 0 0 0 0 0 0 | 199 f = comp.windows.x 0 2 0 5 0 0 168 3 0 2 2 0 1 0 0 0 0 0 1 0 | 184 g = misc.forsale 0 0 0 1 2 0 2 182 3 0 0 0 4 0 0 1 0 0 0 0 | 195 h = rec.autos 0 0 0 0 0 1 5 2 199 0 0 0 1 0 0 0 0 0 0 0 | 208 i = rec.motorcycles 0 0 0 0 0 0 1 0 0 177 1 0 0 0 0 0 0 0 0 0 | 179 j = rec.sport.baseball 0 0 0 1 0 0 0 0 0 0 183 0 0 0 0 0 0 0 0 1 | 185 k = rec.sport.hockey 0 1 0 0 0 3 0 1 0 1 0 193 0 2 0 0 0 1 1 2 | 205 l = sci.crypt 0 3 0 9 4 2 3 1 0 0 1 2 171 0 0 0 0 0 0 0 | 196 m = sci.electronics 0 2 1 1 0 0 1 0 0 0 0 0 1 190 2 0 0 0 0 0 | 198 n = sci.med 0 3 0 0 0 1 0 0 0 0 0 0 2 0 190 0 0 0 2 1 | 199 o = sci.space 4 1 0 1 1 0 0 0 0 0 1 0 0 1 0 212 0 0 1 0 | 222 p = soc.religion.christian 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 170 1 0 1 | 172 q = talk.politics.mideast 0 0 1 0 0 0 1 0 1 0 0 2 0 0 0 0 0 165 0 5 | 175 r = talk.politics.guns 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 3 116 5 | 146 s = talk.religion.misc 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 2 3 1 144 | 153 t = talk.politics.misc Complementary Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 3357 91.0991% Incorrectly Classified Instances : 328 8.9009% Total Classified Instances : 3685 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 159 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 4 0 | 165 a = alt.atheism 0 155 0 8 3 6 3 0 0 0 0 1 2 0 2 0 0 0 0 0 | 180 b = comp.graphics 0 26 104 36 7 6 1 0 0 0 0 0 1 0 2 0 0 0 0 0 | 183 c = comp.os.ms-windows.misc 0 4 2 139 11 0 5 0 0 0 0 1 2 0 0 0 0 0 0 0 | 164 d = comp.sys.ibm.pc.hardware 0 2 1 2 165 0 3 0 0 0 0 1 3 0 0 0 0 0 0 0 | 177 e = comp.sys.mac.hardware 1 13 0 5 2 175 3 0 0 0 0 0 0 0 0 0 0 0 0 0 | 199 f = comp.windows.x 0 2 0 5 0 0 168 3 0 2 2 0 1 0 0 0 0 0 1 0 | 184 g = misc.forsale 0 0 0 1 2 0 2 182 3 0 0 0 4 0 0 1 0 0 0 0 | 195 h = rec.autos 0 0 0 0 0 1 5 2 199 0 0 0 1 0 0 0 0 0 0 0 | 208 i = rec.motorcycles 0 0 0 0 0 0 1 0 0 177 1 0 0 0 0 0 0 0 0 0 | 179 j = rec.sport.baseball 0 0 0 1 0 0 0 0 0 0 183 0 0 0 0 0 0 0 0 1 | 185 k = rec.sport.hockey 0 1 0 0 0 3 0 1 0 1 0 193 0 2 0 0 0 1 1 2 | 205 l = sci.crypt 0 3 0 9 4 2 3 1 0 0 1 2 171 0 0 0 0 0 0 0 | 196 m = sci.electronics 0 2 1 1 0 0 1 0 0 0 0 0 1 190 2 0 0 0 0 0 | 198 n = sci.med 0 3 0 0 0 1 0 0 0 0 0 0 2 0 190 0 0 0 2 1 | 199 o = sci.space 4 1 0 1 1 0 0 0 0 0 1 0 0 1 0 212 0 0 1 0 | 222 p = soc.religion.christian 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 170 1 0 1 | 172 q = talk.politics.mideast 0 0 1 0 0 0 1 0 1 0 0 2 0 0 0 0 0 165 0 5 | 175 r = talk.politics.guns 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 3 116 5 | 146 s = talk.religion.misc 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 2 3 1 144 | 153 t = talk.politics.misc
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1521 (See https://builds.apache.org/job/Mahout-Quality/1521/)
        MAHOUT-1006 Example of 20newsgroups using new naivebayes package, gets 91% accuracy for 20% random split of the dataset (Revision 1345807)

        Result = FAILURE
        robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345807
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/AbstractNaiveBayesClassifier.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/ComplementaryNaiveBayesClassifier.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/AbstractThetaTrainer.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/ComplementaryThetaTrainer.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/StandardThetaTrainer.java
        • /mahout/trunk/examples/bin/classify-20newsgroups.sh
        • /mahout/trunk/src/conf/driver.classes.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1521 (See https://builds.apache.org/job/Mahout-Quality/1521/ ) MAHOUT-1006 Example of 20newsgroups using new naivebayes package, gets 91% accuracy for 20% random split of the dataset (Revision 1345807) Result = FAILURE robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345807 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/AbstractNaiveBayesClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/ComplementaryNaiveBayesClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/AbstractThetaTrainer.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/ComplementaryThetaTrainer.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/StandardThetaTrainer.java /mahout/trunk/examples/bin/classify-20newsgroups.sh /mahout/trunk/src/conf/driver.classes.props
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1522 (See https://builds.apache.org/job/Mahout-Quality/1522/)
        MAHOUT-1006 Final changes, fixes some flag issues and adds an option in example script to run classifier in cnaivebayes and naivebayes mode (Revision 1345814)

        Result = FAILURE
        robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345814
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java
        • /mahout/trunk/examples/bin/classify-20newsgroups.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1522 (See https://builds.apache.org/job/Mahout-Quality/1522/ ) MAHOUT-1006 Final changes, fixes some flag issues and adds an option in example script to run classifier in cnaivebayes and naivebayes mode (Revision 1345814) Result = FAILURE robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345814 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java /mahout/trunk/examples/bin/classify-20newsgroups.sh
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1523 (See https://builds.apache.org/job/Mahout-Quality/1523/)
        MAHOUT-1006 Fixes test to use new format, disabled theta training phase for now. Some code cleanup (Revision 1345821)

        Result = SUCCESS
        robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345821
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/AbstractNaiveBayesClassifier.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/ComplementaryNaiveBayesClassifier.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/StandardNaiveBayesClassifier.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/ComplementaryThetaTrainer.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/StandardThetaTrainer.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainNaiveBayesJob.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/naivebayes/NaiveBayesTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapperTest.java
        • /mahout/trunk/examples/bin/classify-20newsgroups.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1523 (See https://builds.apache.org/job/Mahout-Quality/1523/ ) MAHOUT-1006 Fixes test to use new format, disabled theta training phase for now. Some code cleanup (Revision 1345821) Result = SUCCESS robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345821 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/AbstractNaiveBayesClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/ComplementaryNaiveBayesClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/StandardNaiveBayesClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/ComplementaryThetaTrainer.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/StandardThetaTrainer.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainNaiveBayesJob.java /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/naivebayes/NaiveBayesTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapperTest.java /mahout/trunk/examples/bin/classify-20newsgroups.sh
        Hide
        Robin Anil added a comment -

        Ran classification on asf mail archives example. From creating vectors to training and testing took 15 mins on a my 2011 macbook.
        Creating vectors seq2sparse took most of it about 13mins
        NaiveBayes train took 105 seconds
        NaiveBayes test took 18 seconds.

        12/06/04 17:39:59 INFO test.TestNaiveBayesDriver: Complementary Results: =======================================================
        Summary
        -------------------------------------------------------
        Correctly Classified Instances : 68443 96.7488%
        Incorrectly Classified Instances : 2300 3.2512%
        Total Classified Instances : 70743

        =======================================================
        Confusion Matrix
        -------------------------------------------------------
        a b <--Classified as
        41625 653 | 42278 a = cocoon.apache.org
        1647 26818 | 28465 b = commons.apache.org

        Show
        Robin Anil added a comment - Ran classification on asf mail archives example. From creating vectors to training and testing took 15 mins on a my 2011 macbook. Creating vectors seq2sparse took most of it about 13mins NaiveBayes train took 105 seconds NaiveBayes test took 18 seconds. 12/06/04 17:39:59 INFO test.TestNaiveBayesDriver: Complementary Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 68443 96.7488% Incorrectly Classified Instances : 2300 3.2512% Total Classified Instances : 70743 ======================================================= Confusion Matrix ------------------------------------------------------- a b <--Classified as 41625 653 | 42278 a = cocoon.apache.org 1647 26818 | 28465 b = commons.apache.org
        Hide
        Robin Anil added a comment -

        This is the output of 20% split test using ted encoder
        encoding takes 200 seconds
        train 222s
        test 117s

        12/06/04 18:18:49 INFO test.TestNaiveBayesDriver: Complementary Results: =======================================================
        Summary
        -------------------------------------------------------
        Correctly Classified Instances : 68302 97.8342%
        Incorrectly Classified Instances : 1512 2.1658%
        Total Classified Instances : 69814

        =======================================================
        Confusion Matrix
        -------------------------------------------------------
        a b <--Classified as
        27633 796 | 28429 a = commons.apache.org
        716 40669 | 41385 b = cocoon.apache.org

        Show
        Robin Anil added a comment - This is the output of 20% split test using ted encoder encoding takes 200 seconds train 222s test 117s 12/06/04 18:18:49 INFO test.TestNaiveBayesDriver: Complementary Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 68302 97.8342% Incorrectly Classified Instances : 1512 2.1658% Total Classified Instances : 69814 ======================================================= Confusion Matrix ------------------------------------------------------- a b <--Classified as 27633 796 | 28429 a = commons.apache.org 716 40669 | 41385 b = cocoon.apache.org
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1526 (See https://builds.apache.org/job/Mahout-Quality/1526/)
        MAHOUT-1006 Fixes to run asf classification examples on naivebayes (Revision 1346021)

        Result = SUCCESS
        robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1346021
        Files :

        • /mahout/trunk/bin/mahout
        • /mahout/trunk/examples/bin/asf-email-examples.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1526 (See https://builds.apache.org/job/Mahout-Quality/1526/ ) MAHOUT-1006 Fixes to run asf classification examples on naivebayes (Revision 1346021) Result = SUCCESS robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1346021 Files : /mahout/trunk/bin/mahout /mahout/trunk/examples/bin/asf-email-examples.sh
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1527 (See https://builds.apache.org/job/Mahout-Quality/1527/)
        MAHOUT-1006 Fixes to run asf classification examples on naivebayes using encoder (Revision 1346031)

        Result = SUCCESS
        robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1346031
        Files :

        • /mahout/trunk/examples/bin/asf-email-examples.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1527 (See https://builds.apache.org/job/Mahout-Quality/1527/ ) MAHOUT-1006 Fixes to run asf classification examples on naivebayes using encoder (Revision 1346031) Result = SUCCESS robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1346031 Files : /mahout/trunk/examples/bin/asf-email-examples.sh
        Hide
        Andrii Vozniuk added a comment -

        The example doesn't work for me with Mahout 0.7. Here is what I get:

        mahout prepare20newsgroups -p 20news-bydate-train/ -o 20news-train -a org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8
        MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
        Error: Could not find or load main class classpath
        MAHOUT_LOCAL is set, running locally
        SLF4J: Class path contains multiple SLF4J bindings.
        SLF4J: Found binding in [jar:file:/home/tr/code/mahout-src-0.7/examples/target/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: Found binding in [jar:file:/home/tr/code/mahout-src-0.7/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: Found binding in [jar:file:/home/tr/code/mahout-src-0.7/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
        12/08/07 18:45:51 ERROR driver.MahoutDriver: : Try the new vector backed naivebayes classifier see examples/bin/classify-20newsgroups.sh

        The script examples/bin/classify-20newsgroups.sh works well. Do you plan to make prepare20newsgroups work as in the example from the book, or it is now deprecated?

        Show
        Andrii Vozniuk added a comment - The example doesn't work for me with Mahout 0.7. Here is what I get: mahout prepare20newsgroups -p 20news-bydate-train/ -o 20news-train -a org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8 MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. Error: Could not find or load main class classpath MAHOUT_LOCAL is set, running locally SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/tr/code/mahout-src-0.7/examples/target/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/tr/code/mahout-src-0.7/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/tr/code/mahout-src-0.7/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 12/08/07 18:45:51 ERROR driver.MahoutDriver: : Try the new vector backed naivebayes classifier see examples/bin/classify-20newsgroups.sh The script examples/bin/classify-20newsgroups.sh works well. Do you plan to make prepare20newsgroups work as in the example from the book, or it is now deprecated?
        Hide
        Robin Anil added a comment -

        Prepare 20 newsgroups is deprecated, The new classifier is much faster and work on any vectors(unlike the text based classifier earlier)

        Show
        Robin Anil added a comment - Prepare 20 newsgroups is deprecated, The new classifier is much faster and work on any vectors(unlike the text based classifier earlier)

          People

          • Assignee:
            Robin Anil
            Reporter:
            Ted Dunning
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development