[SPARK-31931] When using GCS as checkpoint location for Structured Streaming aggregation pipeline, the Spark writing job is aborted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.5
Fix Version/s: None
Component/s: Structured Streaming
Labels:
None
Environment:

GCP Dataproc 1.5 Debian 10 (Hadoop 2.10.0, Spark 2.4.5, Cloud Storage Connector hadoop2.2.1.3, Scala 2.12.10)

Description

Structured streaming checkpointing does not work with Google Cloud Storage when there are aggregations included in the streaming pipeline.

Using GCS as the external store works fine when there are no aggregations present in the pipeline (i.e. groupBy); however, once an aggregation is introduced, the attached error is thrown.

The error is only thrown when aggregating and pointing checkpointLocation to GCS. The exact code works fine when pointing checkpointLocation to HDFS.

Is it expected for GCS to function as a checkpoint location for aggregated pipelines? Are efforts currently in progress to enable this? Is it on a roadmap?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

spark-structured-streaming-error
08/Jun/20 16:34
7 kB
Adrian Jones

Activity

People

Assignee:: Unassigned

Reporter:: Adrian Jones

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Jun/20 16:31

Updated:: 08/Jun/20 23:12