[SPARK-23541] Allow Kafka source to read data with greater parallelism than the number of topic-partitions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.4.0
Component/s: Structured Streaming
Labels:
None

Target Version/s:

2.4.0

Description

Currently, when the Kafka source reads from Kafka, it generates as many tasks as the number of partitions in the topic(s) to be read. In some case, it may be beneficial to read the data with greater parallelism, that is, with more number partitions/tasks. That means, offset ranges must be divided up into smaller ranges such the number of records in partition ~= total records in batch / desired partitions. This would also balance out any data skews between topic-partitions.

Attachments

Issue Links

causes

SPARK-28489 KafkaOffsetRangeCalculator.getRanges may drop offsets

Resolved

is duplicated by

SPARK-29799 Split a kafka partition into multiple KafkaRDD partitions in the kafka external plugin for Spark Streaming

Resolved

is related to

SPARK-28464 Document kafka minPartitions option in "Structured Streaming + Kafka Integration Guide"

Resolved

links to

[Github] Pull Request #20698 (tdas)

GitHub Pull Request #20698

Activity

People

Assignee:: Tathagata Das

Reporter:: Tathagata Das

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Mar/18 01:27

Updated:: 13/Apr/20 03:27

Resolved:: 03/Mar/18 02:14