Mahout
  1. Mahout
  2. MAHOUT-838

Make the confusion matrix writable to a file when testing classifiers

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      If you have a lot of labels for a classifier, the confusion matrix is hard to fit in terminal window. Would be nice if we could write it out to a file.

      1. bayes-cm-10.html
        10 kB
        Lance Norskog
      2. MAHOUT-838.patch
        47 kB
        Lance Norskog
      3. ConfusionMatrix.java
        8 kB
        Lance Norskog
      4. MatrixWritable.java
        5 kB
        Lance Norskog
      5. MAHOUT-838_mini.patch
        41 kB
        Lance Norskog
      6. MAHOUT-838.patch
        53 kB
        Lance Norskog
      7. MAHOUT-838.patch
        42 kB
        Lance Norskog
      8. MAHOUT-838.patch
        49 kB
        Lance Norskog
      9. MAHOUT-838.patch
        41 kB
        Sean Owen
      10. MatrixDumper.java
        5 kB
        Lance Norskog

        Issue Links

          Activity

          Hide
          Lance Norskog added a comment - - edited

          Something like this, for example?

          A tip for those of you who can use tools: find the Easter egg.

          Show
          Lance Norskog added a comment - - edited Something like this , for example? A tip for those of you who can use tools: find the Easter egg.
          Hide
          Lance Norskog added a comment -

          This is I think feature-complete&tested. Needs looking and Apache headers.

          Show
          Lance Norskog added a comment - This is I think feature-complete&tested. Needs looking and Apache headers.
          Hide
          Joe Prasanna Kumar added a comment -

          Lance,
          I just tried applying your patch and ran the TestClassifier as below
          $./bin/mahout testclassifier -m /tmp/mahout-work-joekumar/20news-bydate/bayes-model -d /tmp/mahout-work-joekumar/20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce -cm 20newsconfusion
          and got the below exception

          Exception in thread "main" java.lang.Error: Unresolved compilation problem:
          The method getMatrix() is undefined for the type ConfusionMatrix at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierDriver.confusionMatrixSeqFileExport(BayesClassifierDriver.java:144)
          at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierDriver.runJob(BayesClassifierDriver.java:94)
          at org.apache.mahout.classifier.bayes.TestClassifier.classifyParallel(TestClassifier.java:291)
          at org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:194)
          ...
          ...

          I searched in the patch file to see an implementation getMatrix() but couldnt find any. Is there a dependent patch or something ? plz let me know if i'm missing something

          Currently when TestClassifier is run with method option as mapreduce, it doesnt print the Summary. Printing a confusion matrix similar to the screenshot you have attached in this task would be great.

          thanks
          Joe.

          Show
          Joe Prasanna Kumar added a comment - Lance, I just tried applying your patch and ran the TestClassifier as below $./bin/mahout testclassifier -m /tmp/mahout-work-joekumar/20news-bydate/bayes-model -d /tmp/mahout-work-joekumar/20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce -cm 20newsconfusion and got the below exception Exception in thread "main" java.lang.Error: Unresolved compilation problem: The method getMatrix() is undefined for the type ConfusionMatrix at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierDriver.confusionMatrixSeqFileExport(BayesClassifierDriver.java:144) at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierDriver.runJob(BayesClassifierDriver.java:94) at org.apache.mahout.classifier.bayes.TestClassifier.classifyParallel(TestClassifier.java:291) at org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:194) ... ... I searched in the patch file to see an implementation getMatrix() but couldnt find any. Is there a dependent patch or something ? plz let me know if i'm missing something Currently when TestClassifier is run with method option as mapreduce, it doesnt print the Summary. Printing a confusion matrix similar to the screenshot you have attached in this task would be great. thanks Joe.
          Hide
          Joe Prasanna Kumar added a comment -

          Patch in Mahout-812 has to be applied before applying this patch

          Show
          Joe Prasanna Kumar added a comment - Patch in Mahout-812 has to be applied before applying this patch
          Hide
          Lance Norskog added a comment -

          MAHOUT-812 was committed on Oct. 3. If your checkout is up-to-date, you should not need to apply it.

          Show
          Lance Norskog added a comment - MAHOUT-812 was committed on Oct. 3. If your checkout is up-to-date, you should not need to apply it.
          Hide
          Lance Norskog added a comment - - edited

          Replace these files with the uploaded versions;
          ConfusionMatrix.java
          MatrixWritable.java

          Then, apply the MAHOUT-838_mini.patch.

          Somehow, my patch generation and applying tools are wildly confused by the previous patch. Also, they do not give the option to merely replace a file. Perhaps it because it is Friday. Perhaps it is because I'm using Cygwin. Perhaps it is merely because I am bad. Whatever the reason, I cannot create that standard patch for this upgrade.

          Show
          Lance Norskog added a comment - - edited Replace these files with the uploaded versions; ConfusionMatrix.java MatrixWritable.java Then, apply the MAHOUT-838 _mini.patch. Somehow, my patch generation and applying tools are wildly confused by the previous patch. Also, they do not give the option to merely replace a file. Perhaps it because it is Friday. Perhaps it is because I'm using Cygwin. Perhaps it is merely because I am bad . Whatever the reason, I cannot create that standard patch for this upgrade.
          Hide
          Sean Owen added a comment -

          Lance I tried just this, but I still get compile errors. Your posted ConfusionMatrix.java and MatrixWritable.java are identical to HEAD, but are missing methods the new code requires.

          Where is cmdump.props used? is it needed?

          It feels like the matrix and confusion matrix dumper have similar roles. I'd imagine they both go in examples, or both in integration? Integration feels a bit righter.

          Finally I'd have a run at scrubbing style – there are some unused imports, unthrown exceptions, constants that aren't static or final, etc.

          Once you sort out your env, make up another omnibus patch here with all relevant changes.

          Show
          Sean Owen added a comment - Lance I tried just this, but I still get compile errors. Your posted ConfusionMatrix.java and MatrixWritable.java are identical to HEAD, but are missing methods the new code requires. Where is cmdump.props used? is it needed? It feels like the matrix and confusion matrix dumper have similar roles. I'd imagine they both go in examples, or both in integration? Integration feels a bit righter. Finally I'd have a run at scrubbing style – there are some unused imports, unthrown exceptions, constants that aren't static or final, etc. Once you sort out your env, make up another omnibus patch here with all relevant changes.
          Hide
          Lance Norskog added a comment -

          Redone. Everything applies, compiles, runs as doc'd.
          Formatted acc'd to the Lucene format prescribed on the wiki.
          All files have Apache header.

          Show
          Lance Norskog added a comment - Redone. Everything applies, compiles, runs as doc'd. Formatted acc'd to the Lucene format prescribed on the wiki. All files have Apache header.
          Hide
          Sean Owen added a comment -

          OK, better. Now we're to small items. Javadoc must start with /** or it isn't javadoc. Always put braces on conditionals. Why are you making all the Maps into LinkedHashMap? I don't see a reason to order by access time. I see stanzas like:

          String cmPath = params.get("confusionMatrix");
          cmPath = "/tmp";

          This can't be right can it – are there still remanants of debugging code? likewise I see lines like "object.hashCode();" which are no-ops.

          You have a comment about using AbstractJob methods – yes, can this be done upfront?

          What's the purpose of PrintConfusionMatrix? it seems like the other two new Dumper classes fulfill the role outlined in this JIRA.

          Show
          Sean Owen added a comment - OK, better. Now we're to small items. Javadoc must start with /** or it isn't javadoc. Always put braces on conditionals. Why are you making all the Maps into LinkedHashMap? I don't see a reason to order by access time. I see stanzas like: String cmPath = params.get("confusionMatrix"); cmPath = "/tmp"; This can't be right can it – are there still remanants of debugging code? likewise I see lines like "object.hashCode();" which are no-ops. You have a comment about using AbstractJob methods – yes, can this be done upfront? What's the purpose of PrintConfusionMatrix? it seems like the other two new Dumper classes fulfill the role outlined in this JIRA.
          Hide
          Joe Prasanna Kumar added a comment -

          Lance,
          In the attached html file, the confusion matrix has a summary with the labels and the Correct instance count with the % When I run testClassifier, I couldnt see this summary in the output. I exec the testclassifier as below ./bin/mahout testclassifier -m /tmp/mahout-work-joekumar/20news-bydate/bayes-model -d /tmp/mahout-work-joekumar/20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce -cm 20newconfuxion

          The summary is printed in the console when we run the testclassifier in a sequential mode. The toString() of ResultAnalyzer implements the summary info that is being printed in the console. Should we make the summary info as part of the ConfusionMatrix ? Should this be a separate JIRA issue ? In the end, it'd be good to be able to print / write the summary along with the matrix. plz let me know what the best approach would be.

          Joe.

          Show
          Joe Prasanna Kumar added a comment - Lance, In the attached html file, the confusion matrix has a summary with the labels and the Correct instance count with the % When I run testClassifier, I couldnt see this summary in the output. I exec the testclassifier as below ./bin/mahout testclassifier -m /tmp/mahout-work-joekumar/20news-bydate/bayes-model -d /tmp/mahout-work-joekumar/20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce -cm 20newconfuxion The summary is printed in the console when we run the testclassifier in a sequential mode. The toString() of ResultAnalyzer implements the summary info that is being printed in the console. Should we make the summary info as part of the ConfusionMatrix ? Should this be a separate JIRA issue ? In the end, it'd be good to be able to print / write the summary along with the matrix. plz let me know what the best approach would be. Joe.
          Hide
          Lance Norskog added a comment -

          This is roughly what I uploaded last time and marked 'Patch Available' (I thought). Weird.

          Ok, I'm pretty happy with this one. Print* is folded into ConfusionMatrixDumper. Both Dumpers are AbstractJob.

          One quirk: AbstractJob does not support optional inputPath or outputPath when you use those convenience methods.

          It's a lot more streamlined as to options and features. The options to pick what tables you want are gone: you get all four HTML outputs. The low-level print methods for all are public static. Only MatrixDumper does CSV.

          Show
          Lance Norskog added a comment - This is roughly what I uploaded last time and marked 'Patch Available' (I thought). Weird. Ok, I'm pretty happy with this one. Print* is folded into ConfusionMatrixDumper. Both Dumpers are AbstractJob. One quirk: AbstractJob does not support optional inputPath or outputPath when you use those convenience methods. It's a lot more streamlined as to options and features. The options to pick what tables you want are gone: you get all four HTML outputs. The low-level print methods for all are public static. Only MatrixDumper does CSV.
          Hide
          Lance Norskog added a comment -

          Joe-

          I see your point. I have code to extract the summary in the patch. I don't know where it should go. The toString() method is not the right place for what ConfusionMatrix does now. It should be a separate static method, I think.

          Lance

          Show
          Lance Norskog added a comment - Joe- I see your point. I have code to extract the summary in the patch. I don't know where it should go. The toString() method is not the right place for what ConfusionMatrix does now. It should be a separate static method, I think. Lance
          Hide
          Sean Owen added a comment -

          I'm still seeing minor issues, like funny indentation, making all HashMap into LinkedHashMap (why?) and using "+0.000001" for rounding. Is this going to be suitable for committing or should we just let it hang out in this JIRA?

          Show
          Sean Owen added a comment - I'm still seeing minor issues, like funny indentation, making all HashMap into LinkedHashMap (why?) and using "+0.000001" for rounding. Is this going to be suitable for committing or should we just let it hang out in this JIRA?
          Hide
          Ted Dunning added a comment -

          No. I think both of those issues (LinkedHashMap and bad rounding) are black-ball issues.

          They are going to have to be fixed eventually. Might as well be now.

          For explanation, the principle behind the data type issue is that the declared type of a variable should be the weakest type possible to allow more polymorphism and freedom of implementation in other places. For instance, List arguments should be Iterable if possible instead of List or ArrayList and Map's should be Maps rather than HashMap or LinkedHashMap unless there is a really strong reason otherwise.

          For the rounding, it is simply a matter of correctness. Adding a small number and then truncating is just not right. THis is due to a number of reasons, but the short form reason is that floating point arithmetic is more complicated than it looks. Doing rounding right in all the corner cases is harder than it looks and doing it wrong leads to code that behaves in very surprising and bad ways; these corner cases are they way they are in floating point math for very good reasons.

          Show
          Ted Dunning added a comment - No. I think both of those issues (LinkedHashMap and bad rounding) are black-ball issues. They are going to have to be fixed eventually. Might as well be now. For explanation, the principle behind the data type issue is that the declared type of a variable should be the weakest type possible to allow more polymorphism and freedom of implementation in other places. For instance, List arguments should be Iterable if possible instead of List or ArrayList and Map's should be Maps rather than HashMap or LinkedHashMap unless there is a really strong reason otherwise. For the rounding, it is simply a matter of correctness. Adding a small number and then truncating is just not right. THis is due to a number of reasons, but the short form reason is that floating point arithmetic is more complicated than it looks. Doing rounding right in all the corner cases is harder than it looks and doing it wrong leads to code that behaves in very surprising and bad ways; these corner cases are they way they are in floating point math for very good reasons.
          Hide
          Lance Norskog added a comment -

          Above comments are implemented.

          The ConfusionMatrix default column thing is clumsy. This printing code decides that if there are no unclassified items, then the user did not intend to have unclassified items.

          The MatrixDumper does row & column headers from the matrix now.

          Show
          Lance Norskog added a comment - Above comments are implemented. The ConfusionMatrix default column thing is clumsy. This printing code decides that if there are no unclassified items, then the user did not intend to have unclassified items. The MatrixDumper does row & column headers from the matrix now.
          Hide
          Sean Owen added a comment -

          Lance, there are still unresolved issues in this patch. For example, look at the "/tmp" issue I noted earlier. And now for example you're making a LinkedHashMap into a HashMap when I'm not sure that's correct.

          To close this out, I will try to put this into a committable form on my end, and if it's all pretty minor I'll just commit it to move forward.

          Show
          Sean Owen added a comment - Lance, there are still unresolved issues in this patch. For example, look at the "/tmp" issue I noted earlier. And now for example you're making a LinkedHashMap into a HashMap when I'm not sure that's correct. To close this out, I will try to put this into a committable form on my end, and if it's all pretty minor I'll just commit it to move forward.
          Hide
          Sean Owen added a comment -

          OK, um, somehow when I applied the patch, the IDE showed me the change as of the last time I applied the patch. Another application worked. So some comments were out of date, but not all.

          I think there are a number of small style issues with this patch, but it's close. Here is my proposal.

          Show
          Sean Owen added a comment - OK, um, somehow when I applied the patch, the IDE showed me the change as of the last time I applied the patch. Another application worked. So some comments were out of date, but not all. I think there are a number of small style issues with this patch, but it's close. Here is my proposal.
          Hide
          Sean Owen added a comment -

          This patch still doesn't work for me – tests still fail. If this patch is a bit of a struggle, we can consider leaving it in JIRA but otherwise closing it.

          Show
          Sean Owen added a comment - This patch still doesn't work for me – tests still fail. If this patch is a bit of a struggle, we can consider leaving it in JIRA but otherwise closing it.
          Hide
          Lance Norskog added a comment -

          There's something very wrong with how I'm uploading. What I have is not what you are getting. I'll look into it.

          Show
          Lance Norskog added a comment - There's something very wrong with how I'm uploading. What I have is not what you are getting. I'll look into it.
          Hide
          Lance Norskog added a comment -

          The November 1 patch is correct:

          https://issues.apache.org/jira/secure/attachment/12501764/MAHOUT-838.patch
          md5: 41577c6103f8feaf3b44dee9efe53255

          Apply this to a fresh trunk. All of the tests run, and the two jobs run.

          There is one line that changes a HashMap to a LinkedHashMap, and that should not be there. Formatting problems- some existing code has funny indentation and this patch corrects it. Other than that, this is finished.

          Show
          Lance Norskog added a comment - The November 1 patch is correct: https://issues.apache.org/jira/secure/attachment/12501764/MAHOUT-838.patch md5: 41577c6103f8feaf3b44dee9efe53255 Apply this to a fresh trunk. All of the tests run, and the two jobs run. There is one line that changes a HashMap to a LinkedHashMap, and that should not be there. Formatting problems- some existing code has funny indentation and this patch corrects it. Other than that, this is finished.
          Hide
          Sean Owen added a comment -

          OK that looks pretty good. Committed with some tweaks.

          Show
          Sean Owen added a comment - OK that looks pretty good. Committed with some tweaks.
          Hide
          Lance Norskog added a comment -

          Missing MatrixDumper app.

          Show
          Lance Norskog added a comment - Missing MatrixDumper app.
          Hide
          Lance Norskog added a comment -

          Now we're all set to ditch matrix labels and re-implement it all

          Show
          Lance Norskog added a comment - Now we're all set to ditch matrix labels and re-implement it all
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1151 (See https://builds.apache.org/job/Mahout-Quality/1151/)
          MAHOUT-838 add MatrixDumper

          srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197822
          Files :

          • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/MatrixDumper.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1151 (See https://builds.apache.org/job/Mahout-Quality/1151/ ) MAHOUT-838 add MatrixDumper srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197822 Files : /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/MatrixDumper.java

            People

            • Assignee:
              Sean Owen
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development