Mahout
  1. Mahout
  2. MAHOUT-895

Make Wikipedia example set maker easier to mod

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: Classification, Examples
    • Labels:
      None

      Description

      The WikipediaDatasetCreator uses 2 mechanisms to scrape out the text of articles; first an XmlInputFormat is used with the "text" tags as start/end markers (which demarcate the article content), then the content inside the text tags is pattern matched out in the Mapper.

      This means a newcomer must discover both pruning steps before modifying this program to create a dataset including other fields from the article.

      I am attaching a patch which mods the Driver to split on entire articles and changes the mapper to accommodate the extra input without allowing spurious new category matches outside the text element.

        Activity

        tom pierce created issue -
        tom pierce made changes -
        Field Original Value New Value
        Attachment MAHOUT-895.patch [ 12504965 ]
        tom pierce made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Lance Norskog added a comment -

        +1. Anything to make the examples more clear.

        Show
        Lance Norskog added a comment - +1. Anything to make the examples more clear.
        Sean Owen made changes -
        Assignee Sean Owen [ srowen ]
        Fix Version/s 0.6 [ 12316364 ]
        Resolution Fixed [ 1 ]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1206 (See https://builds.apache.org/job/Mahout-Quality/1206/)
        MAHOUT-895 Match Wikipedia start/close tags as-is without preprocessing

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1207060
        Files :

        • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorDriver.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1206 (See https://builds.apache.org/job/Mahout-Quality/1206/ ) MAHOUT-895 Match Wikipedia start/close tags as-is without preprocessing srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1207060 Files : /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorDriver.java /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Sean Owen
            Reporter:
            tom pierce
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development