[MAHOUT-1629] Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder as --input - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 0.9
Fix Version/s: 0.12.0
Component/s: classic
Labels:
- legacy
Environment:

AWS EMR with AMI 3.2.3

Description

When running 'mahout cvb' command on AWS EMR having option --input with value like s3://mybucket/input/ or s3://mybucket/input/* (7 input files in my case) the content of doc-topic output is really non-sense. It seems like the docIds in doc-topic output are shuffled. But the topic model output (p(term|topic) for each topic) looks still fine.

The workaround is to first copy input files from s3 to cluster's hdfs with command:

hadoop fs -cp s3://mybucket/input /input

and then running mahout cvb with option --input /input .

Attachments

Activity

People

Assignee:: Andrew Musselman

Reporter:: Markus Paaso

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Nov/14 07:28

Updated:: 31/Jan/24 22:14

Resolved:: 17/Mar/16 16:02