Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1629

Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder as --input

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 0.9
    • 0.12.0
    • classic
    • AWS EMR with AMI 3.2.3

    Description

      When running 'mahout cvb' command on AWS EMR having option --input with value like s3://mybucket/input/ or s3://mybucket/input/* (7 input files in my case) the content of doc-topic output is really non-sense. It seems like the docIds in doc-topic output are shuffled. But the topic model output (p(term|topic) for each topic) looks still fine.

      The workaround is to first copy input files from s3 to cluster's hdfs with command:

      hadoop fs -cp s3://mybucket/input /input

      and then running mahout cvb with option --input /input .

      Attachments

        Activity

          People

            andrew.musselman Andrew Musselman
            markus.paaso Markus Paaso
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: