Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.14.4, 1.15.1
Description
Scenario:
- CheckpointCoordinator - Completed checkpoint 14 for job 00000000000000000000000000000000
- HybridSource successfully completed processing a few SourceFactories, that reads from s3
- HybridSourceSplitEnumerator.switchEnumerator failed with com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out. This is intermittent error, it is usually fixed, when Flink recover from checkpoint & repeat the operation.
- Flink starts recovering from checkpoint,
- HybridSourceSplitEnumerator receives SourceReaderFinishedEvent{sourceIndex=-1}
- Processing this event cause
2022/08/08 08:39:34.862 ERROR o.a.f.r.s.c.SourceCoordinator - Uncaught exception in the SplitEnumerator for Source Source: hybrid-source while handling operator event SourceEventWrapper[SourceReaderFinishedEvent
{sourceIndex=-1}] from subtask 6. Triggering job failover.
java.lang.NullPointerException: Source for index=0 is not available from sources: {788=org.apache.flink.connector.file.src.SppFileSource@5a3803f3}
at org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:104)
at org.apache.flink.connector.base.source.hybridspp.SwitchedSources.sourceOf(SwitchedSources.java:36)
at org.apache.flink.connector.base.source.hybridspp.HybridSourceSplitEnumerator.sendSwitchSourceEvent(HybridSourceSplitEnumerator.java:152)
at org.apache.flink.connector.base.source.hybridspp.HybridSourceSplitEnumerator.handleSourceEvent(HybridSourceSplitEnumerator.java:226)
...
I'm running my version of the Hybrid Sources with additional logging, so line numbers & some names could be different from Flink Github.
My Observation: the problem is intermittent, sometimes it works ok, i.e. SourceReaderFinishedEvent comes with correct sourceIndex. As I see from my log, it happens if my SourceFactory.create() is executed BEFORE HybridSourceSplitEnumerator - handleSourceEvent SourceReaderFinishedEvent{sourceIndex=-1}.
If HybridSourceSplitEnumerator - handleSourceEvent is executed before my SourceFactory.create(), then sourceIndex=-1 in SourceReaderFinishedEvent
Preconditions-checkNotNull-error log from JobMgr is attached