Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.3.0
-
None
Description
Currently, when the Kafka source reads from Kafka, it generates as many tasks as the number of partitions in the topic(s) to be read. In some case, it may be beneficial to read the data with greater parallelism, that is, with more number partitions/tasks. That means, offset ranges must be divided up into smaller ranges such the number of records in partition ~= total records in batch / desired partitions. This would also balance out any data skews between topic-partitions.
Attachments
Issue Links
- causes
-
SPARK-28489 KafkaOffsetRangeCalculator.getRanges may drop offsets
- Resolved
- is duplicated by
-
SPARK-29799 Split a kafka partition into multiple KafkaRDD partitions in the kafka external plugin for Spark Streaming
- Resolved
- is related to
-
SPARK-28464 Document kafka minPartitions option in "Structured Streaming + Kafka Integration Guide"
- Resolved
- links to