[SPARK-20287] Kafka Consumer should be able to subscribe to more than one topic partition - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: Structured Streaming
Labels:
None

Description

As I understand and as it stands, one Kafka Consumer is created for each topic partition in the source Kafka topics, and they're cached.

cf https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48

In my opinion, that makes the design an anti pattern for Kafka and highly unefficient:

Each Kafka consumer creates a connection to Kafka
Spark doesn't leverage the power of the Kafka consumers, which is that it automatically assigns and balances partitions amongst all the consumers that share the same group.id
You can still cache your Kafka consumer even if it has multiple partitions.

I'm not sure about how that translates to the spark underlying RDD architecture, but from a Kafka standpoint, I believe creating one consumer per partition is a big overhead, and a risk as the user may have to increase the spark.streaming.kafka.consumer.cache.maxCapacity parameter.

Happy to discuss to understand the rationale

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Stephane Maarek

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Apr/17 00:52

Updated:: 02/Oct/19 17:38

Resolved:: 18/Apr/17 00:35