Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2442

Add a Hadoop Writable serializer

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Using data read from hadoop files in shuffles can cause exceptions with the following stacktrace:

      java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable
      	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181)
      	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
      	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
      	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
      	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175)
      	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
      	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
      	at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
      	at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
      	at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
      	at org.apache.spark.scheduler.Task.run(Task.scala:51)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:679)
      

      This though seems to go away if Kyro serializer is used. I am wondering if adding a Hadoop-writables friendly serializer makes sense as it is likely to perform better than Kyro without registration, since Writables don't implement Serializable - so the serialization might not be the most efficient.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            hshreedharan Hari Shreedharan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment