Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
2.0.2, 2.1.0
-
None
-
None
Description
Right now the StructuredStreaming Kafka Source creates as many Spark tasks as there are TopicPartitions that we're going to read from Kafka.
This doesn't work well when we have data skew, and there is no reason why we shouldn't be able to increase parallelism further, i.e. have multiple Spark tasks reading from the same Kafka TopicPartition.
What this will mean is that we won't be able to use the "CachedKafkaConsumer" for what it is defined for (being cached) in this use case, but the extra overhead is worth handling data skew and increasing parallelism especially in ETL use cases.