Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1615

SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None

      Description

      When reading in seq2sparse output from HDFS in the spark-shell of form <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds with the same Key for all Pairs:

      mahout> val drmTFIDF= drmFromHDFS( path = "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
      

      Has keys:

      {...}
      key: /talk.religion.misc/84570
      key: /talk.religion.misc/84570
      key: /talk.religion.misc/84570
      {...}

      for the entire set. This is the last Key in the set.

      The problem can be traced to the first line of drmFromHDFS(...) in SparkEngine.scala:

       val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], minPartitions = parMin)
              // Get rid of VectorWritable
              .map(t => (t._1, t._2.get()))
      

      which gives the same key for all t._1.

        Attachments

          Activity

            People

            • Assignee:
              Andrew_Palumbo Andrew Palumbo
              Reporter:
              Andrew_Palumbo Andrew Palumbo
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: