Mahout
  1. Mahout
  2. MAHOUT-941

Improve ConfusionMatrix statistics

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8
    • Component/s: Classification
    • Labels:
      None

      Description

      This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.

      1. Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
      2. Add mean & standard deviation of "Reliability" (User Accuracy) - assist in identifying consistent mal-assignment against "good" and "bad" labels.
      1. Bayes.zip
        2 kB
        Lance Norskog
      2. MAHOUT-941.patch
        13 kB
        Lance Norskog
      3. MAHOUT-941.patch
        22 kB
        Lance Norskog
      4. MAHOUT-941.patch
        14 kB
        Lance Norskog
      5. SGD.zip
        2 kB
        Lance Norskog

        Issue Links

          Activity

          Lance Norskog created issue -
          Lance Norskog made changes -
          Field Original Value New Value
          Attachment MAHOUT-941.patch [ 12509509 ]
          Hide
          Lance Norskog added a comment -

          Suggestion: leave out 'Success' if you commit. It is not a finished product. I was unable to cleanly remove it from the patch.

          Removing the quoted text was a serious win- SGD worked much better without quoted text and subjects, oddly. See attached zipped files Bayes.zip and SGD.zip for test runs. I worked against a sample of the Apache email archives; it's on the net somewhere but I can't find the link just now.

          Show
          Lance Norskog added a comment - Suggestion: leave out 'Success' if you commit. It is not a finished product. I was unable to cleanly remove it from the patch. Removing the quoted text was a serious win- SGD worked much better without quoted text and subjects, oddly. See attached zipped files Bayes.zip and SGD.zip for test runs. I worked against a sample of the Apache email archives; it's on the net somewhere but I can't find the link just now.
          Hide
          Lance Norskog added a comment -

          These file contain the final output of 8 runs with:
          bayes v.s. sgd
          quoted text in bodies v.s. stripped
          subject line v.s. no subject line

          Show
          Lance Norskog added a comment - These file contain the final output of 8 runs with: bayes v.s. sgd quoted text in bodies v.s. stripped subject line v.s. no subject line
          Lance Norskog made changes -
          Attachment SGD.zip [ 12509510 ]
          Attachment Bayes.zip [ 12509511 ]
          Grant Ingersoll made changes -
          Fix Version/s 0.6 [ 12316364 ]
          Grant Ingersoll made changes -
          Assignee Grant Ingersoll [ gsingers ]
          Grant Ingersoll made changes -
          Link This issue is related to MAHOUT-939 [ MAHOUT-939 ]
          Hide
          Grant Ingersoll added a comment -

          Will likely work on these two together, as I have similar changes already in MAHOUT-939 locally

          Show
          Grant Ingersoll added a comment - Will likely work on these two together, as I have similar changes already in MAHOUT-939 locally
          Hide
          Grant Ingersoll added a comment -

          Lance, can you separate out the stats piece into a different issue? I'll fold the quoted stuff in with MAHOUT-939 and then we can deal with the stats in other places

          Show
          Grant Ingersoll added a comment - Lance, can you separate out the stats piece into a different issue? I'll fold the quoted stuff in with MAHOUT-939 and then we can deal with the stats in other places
          Hide
          Grant Ingersoll added a comment -

          Or, just rename this issue to just be the stats piece

          Show
          Grant Ingersoll added a comment - Or, just rename this issue to just be the stats piece
          Hide
          Lance Norskog added a comment -

          Rename this to focus on Confusion Matrix stats.
          Stripper for quoted lines is added to MAHOUT-939, removed from this patch.

          Show
          Lance Norskog added a comment - Rename this to focus on Confusion Matrix stats. Stripper for quoted lines is added to MAHOUT-939 , removed from this patch.
          Lance Norskog made changes -
          Summary Strip quoted text from emails and add statistics to ConfusionMatrix Improve ConfusionMatrix statistics
          Grant Ingersoll made changes -
          Fix Version/s 0.7 [ 12319261 ]
          Fix Version/s 0.6 [ 12316364 ]
          Lance Norskog made changes -
          Description This patch does 2 things:
          # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
          ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.

          # Adds some dubious overall measurements to the ConfusionMatrix.
          ** Kappa - a standard measurement.
          *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
          *** I think this is an "unweighted" kappa.
          ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
          *** The standard deviation shows the distance between the success of each producer->consumer box.
          This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
          # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
          # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.

          Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
          Hide
          Lance Norskog added a comment -

          Remove email processing additions, moved to MAHOUT-941.

          Enhance statistics to assist tuning classifiers. Add CSV output for graphing incremental SGD models.

          Show
          Lance Norskog added a comment - Remove email processing additions, moved to MAHOUT-941 . Enhance statistics to assist tuning classifiers. Add CSV output for graphing incremental SGD models.
          Lance Norskog made changes -
          Attachment MAHOUT-941.patch [ 12509872 ]
          Grant Ingersoll made changes -
          Fix Version/s 0.8 [ 12320153 ]
          Fix Version/s 0.7 [ 12319261 ]
          Robin Anil made changes -
          Assignee Grant Ingersoll [ gsingers ] Robin Anil [ robinanil ]
          Hide
          Robin Anil added a comment -

          Complementary Results:
          =======================================================
          Summary
          -------------------------------------------------------
          Correctly Classified Instances : 68210 97.9058%
          Incorrectly Classified Instances : 1459 2.0942%
          Total Classified Instances : 69669

          =======================================================
          Confusion Matrix
          -------------------------------------------------------
          a b <--Classified as
          27615 756 | 28371 a = commons.apache.org
          703 40595 | 41298 b = cocoon.apache.org

          =======================================================
          Statistics
          -------------------------------------------------------
          Kappa : -1.1483
          Accuracy : 0.6522
          Consistency (stdev of accuracy) : 0.5052

          I am seeing this. Why is accuracy 0.65 when its actually 0.987. Can you fix this issue.

          Show
          Robin Anil added a comment - Complementary Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 68210 97.9058% Incorrectly Classified Instances : 1459 2.0942% Total Classified Instances : 69669 ======================================================= Confusion Matrix ------------------------------------------------------- a b <--Classified as 27615 756 | 28371 a = commons.apache.org 703 40595 | 41298 b = cocoon.apache.org ======================================================= Statistics ------------------------------------------------------- Kappa : -1.1483 Accuracy : 0.6522 Consistency (stdev of accuracy) : 0.5052 I am seeing this. Why is accuracy 0.65 when its actually 0.987. Can you fix this issue.
          Hide
          Lance Norskog added a comment -

          1) Grrrr.. correct is supposed to be a summer.

           correct = confusionMatrix[labelId][labelId];
          

          2) This is printed out wrong. The "accuracy" up above is "producer's accuracy". This code calculates that and "user's accuracy", or "reliability". These are different. The printout should show both accuracies. Possibly also the mean of the two.

          Imagine classification as the code throwing balls of different sizes to robot arms each programmed to grab one size. If none grab the ball, that's 'unclassified' Producer's accuracy is from the thrower's point of view, user's accuracy is from the robot arms' points of view. They are different counts because 'unclassified' is part of the producer's 'wrong' count, while it is ignored by the user's counts.

          http://spatial-analyst.net/ILWIS/htm/ilwismen/confusion_matrix.htm

          Show
          Lance Norskog added a comment - 1) Grrrr.. correct is supposed to be a summer. correct = confusionMatrix[labelId][labelId]; 2) This is printed out wrong. The "accuracy" up above is "producer's accuracy". This code calculates that and "user's accuracy", or "reliability". These are different. The printout should show both accuracies. Possibly also the mean of the two. Imagine classification as the code throwing balls of different sizes to robot arms each programmed to grab one size. If none grab the ball, that's 'unclassified' Producer's accuracy is from the thrower's point of view, user's accuracy is from the robot arms' points of view. They are different counts because 'unclassified' is part of the producer's 'wrong' count, while it is ignored by the user's counts. http://spatial-analyst.net/ILWIS/htm/ilwismen/confusion_matrix.htm
          Hide
          Robin Anil added a comment -

          Lance can you send the patch in.

          Show
          Robin Anil added a comment - Lance can you send the patch in.
          Hide
          Lance Norskog added a comment - - edited

          Prints Accuracy and Reliability stats, plus standard deviation of reliability.

          Accuracy = "Producer Accuracy", includes unclassified results.
          Reliability = "User Accuracy", does not include unclassified results.

          Show
          Lance Norskog added a comment - - edited Prints Accuracy and Reliability stats, plus standard deviation of reliability. Accuracy = "Producer Accuracy", includes unclassified results. Reliability = "User Accuracy", does not include unclassified results.
          Lance Norskog made changes -
          Attachment MAHOUT-941.patch [ 12531222 ]
          Lance Norskog made changes -
          Description This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
          # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
          # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.

          Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
          This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
          # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
          # Add mean & standard deviation of "Reliability" (User Accuracy) - assist in identifying consistent mal-assignment against "good" and "bad" labels.

          Hide
          Lance Norskog added a comment -

          This is the output of classify-20newsgroups.sh. "Accuracy" is 90.4 percent. "Reliability" is 85%. The standard deviation of "reliability" is .21. "Kappa" is 0.87- it is the relationship between "accuracy" v.s. "random classification". I do not know if Kappa includes "unclassified" in its formula, or assumes all are classified to known labels. Or perhaps it should be calculated both ways?

          Summary
          -------------------------------------------------------
          Correctly Classified Instances : 6788 90.4102%
          Incorrectly Classified Instances : 720 9.5898%
          Total Classified Instances : 7508

          =======================================================
          Confusion Matrix
          -------------------------------------------------------
          a b c d e f g h i j k l m n o p q r s t <--Classified as
          296 0 0 0 0 0 0 0 0 0 0 0 0 0 1 8 0 2 7 3 | 317 a = alt.atheism
          1 327 4 20 6 14 2 1 0 0 0 1 5 3 1 0 1 0 0 0 | 386 b = comp.graphics
          0 27 217 76 21 17 5 0 0 0 0 4 8 1 1 0 0 0 1 3 | 381 c = comp.os.ms-windows.misc
          0 10 1 315 23 3 9 2 0 0 0 0 8 0 0 0 0 0 0 0 | 371 d = comp.sys.ibm.pc.hardware
          0 5 1 9 348 0 5 1 0 0 0 0 4 0 0 0 0 0 1 1 | 375 e = comp.sys.mac.hardware
          0 23 2 7 1 328 1 0 0 0 0 1 0 1 1 0 0 0 0 0 | 365 f = comp.windows.x
          0 5 0 19 11 0 337 8 2 1 4 4 5 0 3 0 0 0 0 1 | 400 g = misc.forsale
          0 0 0 3 3 1 8 402 2 1 0 0 3 1 0 0 0 0 0 3 | 427 h = rec.autos
          0 0 0 0 0 1 7 5 368 0 0 0 0 1 0 0 0 1 0 1 | 384 i = rec.motorcycles
          1 0 0 0 0 0 1 1 0 379 7 0 0 1 0 0 0 0 0 0 | 390 j = rec.sport.baseball
          0 0 0 1 2 0 0 1 0 4 387 0 0 0 0 1 0 0 0 2 | 398 k = rec.sport.hockey
          0 3 0 1 3 2 0 0 0 0 0 393 2 0 0 0 1 3 1 2 | 411 l = sci.crypt
          0 5 0 12 10 0 5 1 1 0 0 1 328 0 2 0 0 2 1 0 | 368 m = sci.electronics
          1 5 1 3 1 1 1 0 0 0 0 0 2 377 4 0 0 0 1 4 | 401 n = sci.med
          0 5 0 0 1 1 1 0 0 1 0 2 0 1 389 0 0 0 2 2 | 405 o = sci.space
          4 2 0 1 2 0 0 1 0 1 1 0 0 1 0 397 2 2 5 1 | 420 p = soc.religion.christian
          1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 4 359 0 0 1 | 367 q = talk.politics.mideast
          0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 360 0 8 | 371 r = talk.politics.guns
          26 1 0 1 0 0 1 1 1 1 0 0 1 0 2 18 1 4 197 7 | 262 s = talk.religion.misc
          0 0 0 0 1 0 0 1 0 2 0 2 0 0 3 0 3 10 3 284 | 309 t = talk.politics.misc

          =======================================================
          Statistics
          -------------------------------------------------------
          Kappa 0.8759
          Accuracy 90.4102%
          Reliability 85.8359%
          Reliability (standard deviation) 0.2183

          Show
          Lance Norskog added a comment - This is the output of classify-20newsgroups.sh. "Accuracy" is 90.4 percent. "Reliability" is 85%. The standard deviation of "reliability" is .21. "Kappa" is 0.87- it is the relationship between "accuracy" v.s. "random classification". I do not know if Kappa includes "unclassified" in its formula, or assumes all are classified to known labels. Or perhaps it should be calculated both ways? Summary ------------------------------------------------------- Correctly Classified Instances : 6788 90.4102% Incorrectly Classified Instances : 720 9.5898% Total Classified Instances : 7508 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 296 0 0 0 0 0 0 0 0 0 0 0 0 0 1 8 0 2 7 3 | 317 a = alt.atheism 1 327 4 20 6 14 2 1 0 0 0 1 5 3 1 0 1 0 0 0 | 386 b = comp.graphics 0 27 217 76 21 17 5 0 0 0 0 4 8 1 1 0 0 0 1 3 | 381 c = comp.os.ms-windows.misc 0 10 1 315 23 3 9 2 0 0 0 0 8 0 0 0 0 0 0 0 | 371 d = comp.sys.ibm.pc.hardware 0 5 1 9 348 0 5 1 0 0 0 0 4 0 0 0 0 0 1 1 | 375 e = comp.sys.mac.hardware 0 23 2 7 1 328 1 0 0 0 0 1 0 1 1 0 0 0 0 0 | 365 f = comp.windows.x 0 5 0 19 11 0 337 8 2 1 4 4 5 0 3 0 0 0 0 1 | 400 g = misc.forsale 0 0 0 3 3 1 8 402 2 1 0 0 3 1 0 0 0 0 0 3 | 427 h = rec.autos 0 0 0 0 0 1 7 5 368 0 0 0 0 1 0 0 0 1 0 1 | 384 i = rec.motorcycles 1 0 0 0 0 0 1 1 0 379 7 0 0 1 0 0 0 0 0 0 | 390 j = rec.sport.baseball 0 0 0 1 2 0 0 1 0 4 387 0 0 0 0 1 0 0 0 2 | 398 k = rec.sport.hockey 0 3 0 1 3 2 0 0 0 0 0 393 2 0 0 0 1 3 1 2 | 411 l = sci.crypt 0 5 0 12 10 0 5 1 1 0 0 1 328 0 2 0 0 2 1 0 | 368 m = sci.electronics 1 5 1 3 1 1 1 0 0 0 0 0 2 377 4 0 0 0 1 4 | 401 n = sci.med 0 5 0 0 1 1 1 0 0 1 0 2 0 1 389 0 0 0 2 2 | 405 o = sci.space 4 2 0 1 2 0 0 1 0 1 1 0 0 1 0 397 2 2 5 1 | 420 p = soc.religion.christian 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 4 359 0 0 1 | 367 q = talk.politics.mideast 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 360 0 8 | 371 r = talk.politics.guns 26 1 0 1 0 0 1 1 1 1 0 0 1 0 2 18 1 4 197 7 | 262 s = talk.religion.misc 0 0 0 0 1 0 0 1 0 2 0 2 0 0 3 0 3 10 3 284 | 309 t = talk.politics.misc ======================================================= Statistics ------------------------------------------------------- Kappa 0.8759 Accuracy 90.4102% Reliability 85.8359% Reliability (standard deviation) 0.2183
          Robin Anil made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Robin Anil made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #2026 (See https://builds.apache.org/job/Mahout-Quality/2026/)
          MAHOUT-941 new confusion matrix statistics (Revision 1488595)

          Result = SUCCESS
          robinanil :
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/ConfusionMatrix.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/ResultAnalyzer.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/ConfusionMatrixTest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #2026 (See https://builds.apache.org/job/Mahout-Quality/2026/ ) MAHOUT-941 new confusion matrix statistics (Revision 1488595) Result = SUCCESS robinanil : Files : /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/ConfusionMatrix.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/ResultAnalyzer.java /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/ConfusionMatrixTest.java
          Suneel Marthi made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Robin Anil
              Reporter:
              Lance Norskog
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development