Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1629

Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder as --input

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 0.9
    • Fix Version/s: 0.12.0
    • Component/s: Clustering
    • Labels:
    • Environment:

      AWS EMR with AMI 3.2.3

      Description

      When running 'mahout cvb' command on AWS EMR having option --input with value like s3://mybucket/input/ or s3://mybucket/input/* (7 input files in my case) the content of doc-topic output is really non-sense. It seems like the docIds in doc-topic output are shuffled. But the topic model output (p(term|topic) for each topic) looks still fine.

      The workaround is to first copy input files from s3 to cluster's hdfs with command:

      hadoop fs -cp s3://mybucket/input /input

      and then running mahout cvb with option --input /input .

        Attachments

          Activity

            People

            • Assignee:
              andrew.musselman Andrew Musselman
              Reporter:
              markus.paaso Markus Paaso
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: