Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45080

Kafka DSv2 streaming source implementation calls planInputPartitions 4 times per microbatch

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • Structured Streaming
    • None

    Description

      I was tracking through method calls for DSv2 streaming source, and figured out planInputPartitions is called 4 times per microbatch.

      It turned out that multiple calls of planInputPartitions is due to `DataSourceV2ScanExecBase.supportsColumnar`, though it is called through `MicroBatchScanExec.inputPartitions` which is defined as lazy, hence shouldn't happen.

      The behavior seems to be coupled with catalyst and very hard to figure out why, but with SPARK-44505, we can at least fix this per each data source.

      Attachments

        Activity

          People

            kabhwan Jungtaek Lim
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: