[SPARK-18475] Be able to provide higher parallelization for StructuredStreaming Kafka Source - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 2.0.2, 2.1.0
Fix Version/s: None
Component/s: Structured Streaming
Labels:
None

Description

Right now the StructuredStreaming Kafka Source creates as many Spark tasks as there are TopicPartitions that we're going to read from Kafka.
This doesn't work well when we have data skew, and there is no reason why we shouldn't be able to increase parallelism further, i.e. have multiple Spark tasks reading from the same Kafka TopicPartition.

What this will mean is that we won't be able to use the "CachedKafkaConsumer" for what it is defined for (being cached) in this use case, but the extra overhead is worth handling data skew and increasing parallelism especially in ETL use cases.

Attachments

Issue Links

links to

[Github] Pull Request #15909 (brkyvz)

Activity

People

Assignee:: Unassigned

Reporter:: Burak Yavuz

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 16/Nov/16 22:42

Updated:: 13/Jan/17 19:56

Resolved:: 13/Jan/17 19:56