[SPARK-23623] Avoid concurrent use of cached KafkaConsumer in CachedKafkaConsumer (kafka-0-10-sql) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.1, 2.4.0
Component/s: Structured Streaming
Labels:
None

Description

CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a pool of KafkaConsumers that can be reused. However, it was built with the assumption there will be only one task using trying to read the same Kafka TopicPartition at the same time. Hence, the cache was keyed by the TopicPartition a consumer is supposed to read. And any cases where this assumption may not be true, we have SparkPlan flag to disable the use of a cache. So it was up to the planner to correctly identify when it was not safe to use the cache and set the flag accordingly.

Fundamentally, this is the wrong way to approach the problem. It is HARD for a high-level planner to reason about the low-level execution model, whether there will be multiple tasks in the same query trying to read the same partition. Case in point, 2.3.0 introduced stream-stream joins, and you can build a streaming self-join query on Kafka. It's pretty non-trivial to figure out how this leads to two tasks reading the same partition twice, possibly concurrently. And due to the non-triviality, it is hard to figure this out in the planner and set the flag to avoid the cache / consumer pool. And this can inadvertently lead to ConcurrentModificationException ,or worse, silent reading of incorrect data.

Here is a better way to design this. The planner shouldnt have to understand these low-level optimizations. Rather the consumer pool should be smart enough avoid concurrent use of a cached consumer. Currently, it tries to do so but incorrectly (the flag inuse is not checked when returning a cached consumer, see this If there is another request for the same partition as a currently in-use consumer, the pool should automatically return a fresh consumer that should be closed when the task is done.

Attachments

Issue Links

causes

SPARK-24987 Kafka Cached Consumer Leaking File Descriptors

Resolved

is duplicated by

SPARK-20919 Simplification of CachedKafkaConsumer.

Resolved

relates to

SPARK-23714 Add metrics for cached KafkaConsumer

Closed

links to

[Github] Pull Request #20767 (tdas)

[Github] Pull Request #20848 (tdas)

Activity

People

Assignee:: Tathagata Das

Reporter:: Tathagata Das

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Mar/18 02:23

Updated:: 24/Aug/18 22:35

Resolved:: 16/Mar/18 18:11