[KAFKA-17142] Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.8.0
Component/s: None
Labels:
None

Description

`testLogRecoveryMetrics` uses mock scheduler to create UnifiedLog [0], and the mock scheduler does NOT have true thread poll, and that is the root cause!

We create recovery threads [1] for each data folder, and they will pass the action: `writeToFileForTruncation` to scheduler [2]. The action requires the read lock [3], so the deadlock is produced when one thread executes the action from another thread. For example:

1. thread_a is handling dir_a, and it holds the writelock_a
2. thread_a pass action_a: `writeToFileForTruncation` to mock scheduler
3. thread_b is handling dir_b, and it holds the writelock_b
4. thread_b pass action_b: `writeToFileForTruncation` to mock scheduler
5. thread_b (holding writelock_b) handle the action_a, so it requires the readlock_a
6. thread_a (holding writelock_a) handle the action_b, so it requires the readlock_b

so lucky we have a deadlock :cry

This deadlock happens due to the mock scheduler, so that is a issue belonging to test. We can fix it by a simple solution - add some delay when creating next UnifiedLog

Or we can do a bit refactor to production code: create snapshot of `epochs` when we are holding write lock! That means `writeToFileForTruncation` does not take lock anymore. For example:

    public void truncateFromStartAsyncFlush(long startOffset) {
        lock.writeLock().lock();
        try {
            List<EpochEntry> removedEntries = truncateFromStart(epochs, startOffset);
            if (!removedEntries.isEmpty()) {
                ...
                List<EpochEntry> entries = new ArrayList<>(epochs.values());
                scheduler.scheduleOnce("leader-epoch-cache-flush-" + topicPartition, () -> checkpoint.writeForTruncation(entries));
                ...
            }
        } finally {
            lock.writeLock().unlock();
        }
    }

[0] https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/core/src/test/scala/unit/kafka/log/LogManagerTest.scala#L968
[1] https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/core/src/main/scala/kafka/log/LogManager.scala#L434
[2] https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java#L351
[3] https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java#L528

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

log
16/Jul/24 07:09
41 kB
Chia-Ping Tsai

Issue Links

links to

GitHub Pull Request #16614

Activity

People

Assignee:: PoAn Yang

Reporter:: Chia-Ping Tsai

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 16/Jul/24 07:02

Updated:: 22/Jul/24 17:25

Resolved:: 18/Jul/24 10:41