Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.4.5
-
None
-
None
-
GCP Dataproc 1.5 Debian 10 (Hadoop 2.10.0, Spark 2.4.5, Cloud Storage Connector hadoop2.2.1.3, Scala 2.12.10)
Description
Structured streaming checkpointing does not work with Google Cloud Storage when there are aggregations included in the streaming pipeline.
Using GCS as the external store works fine when there are no aggregations present in the pipeline (i.e. groupBy); however, once an aggregation is introduced, the attached error is thrown.
The error is only thrown when aggregating and pointing checkpointLocation to GCS. The exact code works fine when pointing checkpointLocation to HDFS.
Is it expected for GCS to function as a checkpoint location for aggregated pipelines? Are efforts currently in progress to enable this? Is it on a roadmap?