Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1018

take and collect don't work on HadoopRDD

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 0.8.1
    • None
    • Spark Core

    Description

      I am reading a simple text file using hadoopFile as follows:
      var hrdd1 = sc.hadoopFile("/home/training/testdata.txt",classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

      Testing using this simple text file:
      001 this is line 1
      002 this is line two
      003 yet another line

      the data read is correct, as I can tell using println
      scala> hrdd1.foreach(println):
      (0,001 this is line 1)
      (19,002 this is line two)
      (40,003 yet another line)

      But neither collect nor take work properly. Take prints out the key (byte offset) of the last (non-existent) line repeatedly:
      scala> hrdd1.take(4):
      res146: Array[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] = Array((61,), (61,), (61,))

      Collect is even worse: it complains:
      java.io.NotSerializableException: org.apache.hadoop.io.LongWritable at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)

      The problem appears to be the LongWritable in both cases, because if I map to a new RDD, converting the values from Text objects to strings, it works:
      scala> hrdd1.map(pair => (pair._1.toString,pair._2.toString)).take(4)
      res148: Array[(java.lang.String, java.lang.String)] = Array((0,001 this is line 1), (19,002 this is line two), (40,003 yet another line))

      Seems to me either rdd.collect and rdd.take ought to handle non-serializable types gracefully, or hadoopFile should return a mapped RDD that converts the hadoop types into the appropriate serializable Java objects. (Or at very least the docs for the API should indicate that the usual RDD methods don't work on HadoopRDDs).

      BTW, this behavior is the same for both the old and new API versions of hadoopFile. It also is the same whether the file is from HDFS or a plain old text file.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            dcarroll@cloudera.com Diana Carroll
            Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment