Reducer gets stuck in copy phase and doesn't make progress for very long time. After killing this task for couple of times manually, it gets completed.
- Verfied gc logs. Found no memory related issues. Attached the logs.
- Verified thread dumps. Found no thread related problems.
- On verification of logs, fetcher threads are not copying the map outputs and they are just waiting for merge to happen.
- Merge thread is alive and in wait state.
On careful observation of logs, thread dumps and code, this looks to me like a classic case of multi-threading issue. Thread goes to wait state after it has been notified.
Here is the suspect code flow.
Fetcher thread - notification comes first
Merge Thread - goes to wait state (Notification goes unconsumed)