Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31931

When using GCS as checkpoint location for Structured Streaming aggregation pipeline, the Spark writing job is aborted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.5
    • None
    • Structured Streaming
    • None
    • GCP Dataproc 1.5 Debian 10 (Hadoop 2.10.0, Spark 2.4.5, Cloud Storage Connector hadoop2.2.1.3, Scala 2.12.10)

    Description

      Structured streaming checkpointing does not work with Google Cloud Storage when there are aggregations included in the streaming pipeline.

      Using GCS as the external store works fine when there are no aggregations present in the pipeline (i.e. groupBy); however, once an aggregation is introduced, the attached error is thrown.

      The error is only thrown when aggregating and pointing checkpointLocation to GCS. The exact code works fine when pointing checkpointLocation to HDFS.

      Is it expected for GCS to function as a checkpoint location for aggregated pipelines? Are efforts currently in progress to enable this? Is it on a roadmap?

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            adrianjonesru Adrian Jones
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: