Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9523

Receiver for Spark Streaming does not naturally support kryo serializer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • 1.3.1
    • None
    • DStreams
    • Windows 7 local mode

    • Patch

    Description

      In some cases, some attributes in a class is not serializable, which you still want to use after serialization of the whole object, you'll have to customize your serialization codes. For example, you can declare those attributes as transient, which makes them ignored during serialization, and then you can reassign their values during deserialization.

      Now, if you're using Java serialization, you'll have to implement Serializable, and write those codes in readObject() and writeObejct() methods; And if you're using kryo serialization, you'll have to implement KryoSerializable, and write these codes in read() and write() methods.

      In Spark and Spark Streaming, you can set kryo as the serializer for speeding up. However, the functions taken by RDD or DStream operations are still serialized by Java serialization, which means you only need to write those custom serialization codes in readObject() and writeObejct() methods.

      But when it comes to Spark Streaming's Receiver, things are different. When you wish to customize an InputDStream, you must extend the Receiver. However, it turns out, the Receiver will be serialized by kryo if you set kryo serializer in SparkConf, and will fall back to Java serialization if you didn't.

      So here's comes the problems, if you want to change the serializer by configuration and make sure the Receiver runs perfectly for both Java and kryo, you'll have to write all the 4 methods above. First, it is redundant, since you'll have to write serialization/deserialization code almost twice; Secondly, there's nothing in the doc or in the code to inform users to implement the KryoSerializable interface.

      Since all other function parameters are serialized by Java only, I suggest you also make it so for the Receiver. It may be slower, but since the serialization will only be executed for each interval, it's durable. More importantly, it can cause fewer trouble

      Attachments

        Activity

          People

            Unassigned Unassigned
            fish748 Yuhang Chen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 120h
                120h
                Remaining:
                Remaining Estimate - 120h
                120h
                Logged:
                Time Spent - Not Specified
                Not Specified