Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2442

Add a Hadoop Writable serializer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Using data read from hadoop files in shuffles can cause exceptions with the following stacktrace:

      java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable
      	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181)
      	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
      	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
      	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
      	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175)
      	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
      	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
      	at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
      	at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
      	at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
      	at org.apache.spark.scheduler.Task.run(Task.scala:51)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:679)
      

      This though seems to go away if Kyro serializer is used. I am wondering if adding a Hadoop-writables friendly serializer makes sense as it is likely to perform better than Kyro without registration, since Writables don't implement Serializable - so the serialization might not be the most efficient.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              hshreedharan Hari Shreedharan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: