[KAFKA-7697] Possible deadlock in kafka.cluster.Partition - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.2.0, 2.1.1
Component/s: None
Labels:
None

Description

After upgrading a fairly busy broker from 0.10.2.0 to 2.1.0, it locked up within a few minutes (by "locked up" I mean that all request handler threads were busy, and other brokers reported that they couldn't communicate with it). I restarted it a few times and it did the same thing each time. After downgrading to 0.10.2.0, the broker was stable. I attached a threaddump.txt from the last attempt on 2.1.0 that shows lots of kafka-request-handler- threads trying to acquire the leaderIsrUpdateLock lock in kafka.cluster.Partition.

It jumps out that there are two threads that already have some read lock (can't tell which one) and are trying to acquire a second one (on two different read locks: 0x0000000708184b88 and 0x000000070821f188): kafka-request-handler-1 and kafka-request-handler-4. Both are handling a produce request, and in the process of doing so, are calling Partition.fetchOffsetSnapshot while trying to complete a DelayedFetch. At the same time, both of those locks have writers from other threads waiting on them (kafka-request-handler-2 and kafka-scheduler-6). Neither of those locks appear to have writers that hold them (if only because no threads in the dump are deep enough in inWriteLock to indicate that).

ReentrantReadWriteLock in nonfair mode prioritizes waiting writers over readers. Is it possible that kafka-request-handler-1 and kafka-request-handler-4 are each trying to read-lock the partition that is currently locked by the other one, and they're both parked waiting for kafka-request-handler-2 and kafka-scheduler-6 to get write locks, which they never will, because the former two threads own read locks and aren't giving them up?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2.1.1-hangs.log
25/Jun/19 09:28
91 kB
Muchl
322.tdump
06/May/19 08:25
89 kB
yanrui
kafka_jstack.txt
05/May/19 01:10
256 kB
yanrui
kafka.log
02/May/19 15:53
108 kB
Jackson Westeen
threaddump.txt
03/Dec/18 17:18
78 kB
Gian Merlino

Issue Links

duplicates

KAFKA-7870 Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was disconnected before the response was read.

Open

KAFKA-8537 Kafka issues after 2.1.0 upgrade: java.net.SocketTimeoutException: Failed to connect within 30000 ms

Open

KAFKA-7802 Connection to Broker Disconnected Taking Down the Whole Cluster

Open

KAFKA-7757 Too many open files after java.io.IOException: Connection to n was disconnected before the response was read

Open

KAFKA-7913 Kafka broker halts and messes up the whole cluster

Open

KAFKA-7876 Broker suddenly got disconnected

Resolved

links to

GitHub Pull Request #5997

GitHub Pull Request #5999

(1 duplicates, 2 links to)

Activity

People

Assignee:: Rajini Sivaram

Reporter:: Gian Merlino

Reviewer:: Jason Gustafson

Votes:: 0 Vote for this issue

Watchers:: 24 Start watching this issue

Dates

Created:: 03/Dec/18 17:20

Updated:: 30/Apr/21 01:40

Resolved:: 05/Dec/18 12:47