Description
We see a deadlock state when streams thread to process a task takes longer than MAX_POLL_INTERVAL_MS_CONFIG time. In this case this threads partitions are assigned to some other thread including rocksdb lock. When it tries to process the next task it cannot get rocks db lock and simply keeps waiting for that lock forever.
in retryWithBackoff for AbstractTaskCreator we have a backoffTimeMs = 50L.
If it does not get lock the we simply increase the time by 10x and keep trying inside the while true loop.
We need to have a upper bound for this backoffTimeM. If the time is greater than MAX_POLL_INTERVAL_MS_CONFIG and it still hasn't got the lock means this thread's partitions are moved somewhere else and it may not get the lock again.
Attachments
Attachments
Issue Links
- links to