Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-286

Need to be able to run classifiers from non-text input (such as ARFF data)

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 0.3
    • Fix Version/s: 0.5
    • Component/s: None
    • Labels:
      None

      Description

      Martin Haeger wrote this:

      We're experimenting a bit with Weka and Mahout. Our input data is a
      relation in ARFF format (see attached data.training.arff), and we'd
      like to classify it using Mahout. However, it seems (to us, at first)
      that the Mahout classifier.bayes.interfaces.Algorithm interface is
      centered around documents of text, and not general attribute data.
      Thus, running the classifier causes our ARFF data to be interpreted as
      a document of words, with not very useful results (see attached
      mahout.log).

      With Weka, we're able to get the results we want (see attached weka.log).

      Any suggestions for how to get this working?

      1. data.arff
        7 kB
        Martin Häger
      2. data.training.arff
        8 kB
        Martin Häger
      3. mahout.log
        25 kB
        Ted Dunning
      4. run.sh
        1.0 kB
        Martin Häger
      5. weka.log
        2 kB
        Ted Dunning

        Issue Links

          Activity

          Hide
          srowen Sean Owen added a comment -

          As part of a bit of house-cleaning I'm at least attaching this to MAHOUT-155, which concerns ARFF.

          Show
          srowen Sean Owen added a comment - As part of a bit of house-cleaning I'm at least attaching this to MAHOUT-155 , which concerns ARFF.
          Hide
          tdunning Ted Dunning added a comment -

          Aside from the arff part, this has been happening off topic in the SGD side of the world.

          Show
          tdunning Ted Dunning added a comment - Aside from the arff part, this has been happening off topic in the SGD side of the world.
          Hide
          srowen Sean Owen added a comment -

          I'll mark this for 0.5 but not seeing any movement on this, so may just get closed out

          Show
          srowen Sean Owen added a comment - I'll mark this for 0.5 but not seeing any movement on this, so may just get closed out
          Hide
          robinanil Robin Anil added a comment -

          I will have to move this to 0.4. Bayes classifier only supports binary features(word exists or not). It definitely needs to be able to support numeric features. That will happen only after converting the classifier to SparseVector format. I could give a patch which extracts only the binary features for this release.

          Show
          robinanil Robin Anil added a comment - I will have to move this to 0.4. Bayes classifier only supports binary features(word exists or not). It definitely needs to be able to support numeric features. That will happen only after converting the classifier to SparseVector format. I could give a patch which extracts only the binary features for this release.
          Hide
          mtah Martin Häger added a comment -

          Attaching:

          • data.arff - test data in ARFF format
          • data.training.arff - training data in ARFF format
          • run.sh - a script that shows how Mahout was run
          Show
          mtah Martin Häger added a comment - Attaching: data.arff - test data in ARFF format data.training.arff - training data in ARFF format run.sh - a script that shows how Mahout was run
          Hide
          tdunning Ted Dunning added a comment -

          Here are the original attachments Martin sent.

          Show
          tdunning Ted Dunning added a comment - Here are the original attachments Martin sent.

            People

            • Assignee:
              Unassigned
              Reporter:
              tdunning Ted Dunning
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development