Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18475

Be able to provide higher parallelization for StructuredStreaming Kafka Source

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 2.0.2, 2.1.0
    • None
    • Structured Streaming
    • None

    Description

      Right now the StructuredStreaming Kafka Source creates as many Spark tasks as there are TopicPartitions that we're going to read from Kafka.
      This doesn't work well when we have data skew, and there is no reason why we shouldn't be able to increase parallelism further, i.e. have multiple Spark tasks reading from the same Kafka TopicPartition.

      What this will mean is that we won't be able to use the "CachedKafkaConsumer" for what it is defined for (being cached) in this use case, but the extra overhead is worth handling data skew and increasing parallelism especially in ETL use cases.

      Attachments

        Activity

          People

            Unassigned Unassigned
            brkyvz Burak Yavuz
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: