[BEAM-2669] Kryo serialization exception when DStreams containing non-Kryo-serializable data are cached - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: P2
Resolution: Fixed
Affects Version/s: 0.4.0, 0.5.0, 0.6.0, 2.0.0
Fix Version/s: 2.2.0
Component/s: runner-spark
Labels:
None

Description

Today, when we detect re-use of a dataset in a pipeline in Spark runner we eagerly cache it to avoid calculating the same data multiple times.
(EvaluationContext.java)

When the dataset is bounded, which in Spark is represented by an RDD, we call RDD#persist and use storage level provided by the user via SparkPipelineOptions. (BoundedDataset.java)

When the dataset is unbounded, which in Spark is represented by a DStream we call DStream.cache() which defaults to persist the DStream using storage level MEMORY_ONLY_SER (UnboundedDataset.java)
(DStream.scala)

Storage level MEMORY_ONLY_SER means Spark will serialize the data using its configured serializer. Since we configure this to be Kryo in a hard coded fashion, this means the data will be serialized using Kryo. (SparkContextFactory.java)

Due to this, if your DStream contains non-Kryo-serializable data you will encounter Kryo serialization exceptions and your task will fail.

Attachments

Issue Links

is related to

BEAM-2825 Improve readability of SparkGroupAlsoByWindowViaWindowSet

Resolved

links to

GitHub Pull Request #3748

GitHub Pull Request #3749

Activity

People

Assignee:: Kobi Salant

Reporter:: Aviem Zur

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 24/Jul/17 08:27

Updated:: 24/Jul/20 20:10

Resolved:: 03/Sep/17 06:26

Agile

View on Board

Kryo serialization exception when DStreams containing non-Kryo-serializable data are cached