Thanks Arun Suresh for the comment and confirming on the fix direction!
I found out what was missed in my previous mentioned approach: I was just removing the runnables from executor, but did not handle the already started runnables right.
in the drain call, do an executors.shutdownnow(), clear all the queues and restart the executor threads.
My thought over this is that shutting down all tasks may be an overkill - we only need to cancel the tasks for the key that's being rolled. Also, going this way we need to handle the executorThreadsStarted race between submitRefillTask and restart. How do you think about my approach below?
Attached patch 6, locally ran the reproduce snippet 10k time and no failure.
- As discussed, focusing on fixing the ValueQueue, instead of hacking it for the test.
- On ValueQueue#drain, remove all NamedRunnable from the ThreadPoolExecutor's queue for that specific key being drained. Updated keysInProgress data structure in NamedRunnable from set to map to achieve this.
- Also need to cancel the NamedRunnable that's already being executed.
- Since we're not submitting to the executor, we can't get a Future to cancel. I added an AtomicBoolean to flag whether a NamedRunnable is canceled, and manually cancel job / clear wrongly-filled queue according to this flag.
- We need to handle the race on keyQueue between drain and get. Added ReadWriteLock for this purpose. Each key uses a separate lock, so that operations on different keys don't impact each other.
- Since the keyQueue are loaded by LoadingCache, I didn't find a perfect place to initialize the locks for keyQueue. Lazy initialization is used to create the locks.
I understand this patch adds locking and may impact performance, but I think this is the trade off for supporting same client to roll and get (and I feel we should support it). I've been trying to keep the impact (locking/runnable canceling) scope under each key, which is the same way roll and get works currently.
Please review and share your thoughts. Thank you!