Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1615

SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.10.0
    • None
    • None

    Description

      When reading in seq2sparse output from HDFS in the spark-shell of form <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds with the same Key for all Pairs:

      mahout> val drmTFIDF= drmFromHDFS( path = "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
      

      Has keys:

      {...}
      key: /talk.religion.misc/84570
      key: /talk.religion.misc/84570
      key: /talk.religion.misc/84570
      {...}

      for the entire set. This is the last Key in the set.

      The problem can be traced to the first line of drmFromHDFS(...) in SparkEngine.scala:

       val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], minPartitions = parMin)
              // Get rid of VectorWritable
              .map(t => (t._1, t._2.get()))
      

      which gives the same key for all t._1.

      Attachments

        Activity

          People

            Andrew_Palumbo Andrew Palumbo
            Andrew_Palumbo Andrew Palumbo
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: