Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2669

Kryo serialization exception when DStreams containing non-Kryo-serializable data are cached

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 0.4.0, 0.5.0, 0.6.0, 2.0.0
    • 2.2.0
    • runner-spark
    • None

    Description

      Today, when we detect re-use of a dataset in a pipeline in Spark runner we eagerly cache it to avoid calculating the same data multiple times.
      (EvaluationContext.java)

      When the dataset is bounded, which in Spark is represented by an RDD, we call RDD#persist and use storage level provided by the user via SparkPipelineOptions. (BoundedDataset.java)

      When the dataset is unbounded, which in Spark is represented by a DStream we call DStream.cache() which defaults to persist the DStream using storage level MEMORY_ONLY_SER (UnboundedDataset.java)
      (DStream.scala)

      Storage level MEMORY_ONLY_SER means Spark will serialize the data using its configured serializer. Since we configure this to be Kryo in a hard coded fashion, this means the data will be serialized using Kryo. (SparkContextFactory.java)

      Due to this, if your DStream contains non-Kryo-serializable data you will encounter Kryo serialization exceptions and your task will fail.

      Attachments

        Issue Links

          Activity

            People

              ksalant Kobi Salant
              aviemzur Aviem Zur
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Slack

                  Issue deployment