[MAHOUT-1615] SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.10.0
Component/s: None
Labels:
None

Description

When reading in seq2sparse output from HDFS in the spark-shell of form <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds with the same Key for all Pairs:

mahout> val drmTFIDF= drmFromHDFS( path = "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")

Has keys:

{...}
key: /talk.religion.misc/84570
key: /talk.religion.misc/84570
key: /talk.religion.misc/84570
{...}

for the entire set. This is the last Key in the set.

The problem can be traced to the first line of drmFromHDFS(...) in SparkEngine.scala:

 val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], minPartitions = parMin)
        // Get rid of VectorWritable
        .map(t => (t._1, t._2.get()))

which gives the same key for all t._1.

Attachments

Activity

People

Assignee:: Andrew Palumbo

Reporter:: Andrew Palumbo

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Sep/14 22:58

Updated:: 13/Apr/15 10:20

Resolved:: 10/Oct/14 21:41