Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20287

Kafka Consumer should be able to subscribe to more than one topic partition

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 2.1.0
    • None
    • Structured Streaming
    • None

    Description

      As I understand and as it stands, one Kafka Consumer is created for each topic partition in the source Kafka topics, and they're cached.

      cf https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48

      In my opinion, that makes the design an anti pattern for Kafka and highly unefficient:

      • Each Kafka consumer creates a connection to Kafka
      • Spark doesn't leverage the power of the Kafka consumers, which is that it automatically assigns and balances partitions amongst all the consumers that share the same group.id
      • You can still cache your Kafka consumer even if it has multiple partitions.

      I'm not sure about how that translates to the spark underlying RDD architecture, but from a Kafka standpoint, I believe creating one consumer per partition is a big overhead, and a risk as the user may have to increase the spark.streaming.kafka.consumer.cache.maxCapacity parameter.

      Happy to discuss to understand the rationale

      Attachments

        Activity

          People

            Unassigned Unassigned
            stephane.maarek@gmail.com Stephane Maarek
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: