Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-895

Make Wikipedia example set maker easier to mod

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: Classification, Examples
    • Labels:
      None

      Description

      The WikipediaDatasetCreator uses 2 mechanisms to scrape out the text of articles; first an XmlInputFormat is used with the "text" tags as start/end markers (which demarcate the article content), then the content inside the text tags is pattern matched out in the Mapper.

      This means a newcomer must discover both pruning steps before modifying this program to create a dataset including other fields from the article.

      I am attaching a patch which mods the Driver to split on entire articles and changes the mapper to accommodate the extra input without allowing spurious new category matches outside the text element.

        Attachments

          Activity

            People

            • Assignee:
              srowen Sean R. Owen
              Reporter:
              tcp Tom Pierce

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment