Details
-
Bug
-
Status: Resolved
-
P2
-
Resolution: Fixed
-
0.4.0, 0.5.0, 0.6.0, 2.0.0
-
None
Description
Today, when we detect re-use of a dataset in a pipeline in Spark runner we eagerly cache it to avoid calculating the same data multiple times.
(EvaluationContext.java)
When the dataset is bounded, which in Spark is represented by an RDD, we call RDD#persist and use storage level provided by the user via SparkPipelineOptions. (BoundedDataset.java)
When the dataset is unbounded, which in Spark is represented by a DStream we call DStream.cache() which defaults to persist the DStream using storage level MEMORY_ONLY_SER (UnboundedDataset.java)
(DStream.scala)
Storage level MEMORY_ONLY_SER means Spark will serialize the data using its configured serializer. Since we configure this to be Kryo in a hard coded fashion, this means the data will be serialized using Kryo. (SparkContextFactory.java)
Due to this, if your DStream contains non-Kryo-serializable data you will encounter Kryo serialization exceptions and your task will fail.
Attachments
Issue Links
- is related to
-
BEAM-2825 Improve readability of SparkGroupAlsoByWindowViaWindowSet
- Resolved
- links to