Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-28817

NullPointerException in HybridSource when restoring from checkpoint

    XMLWordPrintableJSON

Details

    Description

      Scenario:

      1. CheckpointCoordinator - Completed checkpoint 14 for job 00000000000000000000000000000000
      2. HybridSource successfully completed processing a few SourceFactories, that reads from s3
      3. HybridSourceSplitEnumerator.switchEnumerator failed with com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out. This is intermittent error, it is usually fixed, when Flink recover from checkpoint & repeat the operation.
      4. Flink starts recovering from checkpoint, 
      5. HybridSourceSplitEnumerator receives SourceReaderFinishedEvent{sourceIndex=-1}
      6. Processing this event cause 

      2022/08/08 08:39:34.862 ERROR o.a.f.r.s.c.SourceCoordinator - Uncaught exception in the SplitEnumerator for Source Source: hybrid-source while handling operator event SourceEventWrapper[SourceReaderFinishedEvent

      {sourceIndex=-1}

      ] from subtask 6. Triggering job failover.
      java.lang.NullPointerException: Source for index=0 is not available from sources: {788=org.apache.flink.connector.file.src.SppFileSource@5a3803f3}
      at org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:104)
      at org.apache.flink.connector.base.source.hybridspp.SwitchedSources.sourceOf(SwitchedSources.java:36)
      at org.apache.flink.connector.base.source.hybridspp.HybridSourceSplitEnumerator.sendSwitchSourceEvent(HybridSourceSplitEnumerator.java:152)
      at org.apache.flink.connector.base.source.hybridspp.HybridSourceSplitEnumerator.handleSourceEvent(HybridSourceSplitEnumerator.java:226)
      ...

      I'm running my version of the Hybrid Sources with additional logging, so line numbers & some names could be different from Flink Github.

      My Observation: the problem is intermittent, sometimes it works ok, i.e. SourceReaderFinishedEvent comes with correct sourceIndex. As I see from my log, it happens if my SourceFactory.create()  is executed BEFORE HybridSourceSplitEnumerator - handleSourceEvent SourceReaderFinishedEvent{sourceIndex=-1}.
      If  HybridSourceSplitEnumerator - handleSourceEvent is executed before my SourceFactory.create(), then sourceIndex=-1 in SourceReaderFinishedEvent

      Preconditions-checkNotNull-error log from JobMgr is attached

      Attachments

        1. Preconditions-checkNotNull-error.zip
          288 kB
          Michael
        2. bf-29-JM-err-analysis.log
          17 kB
          Michael

        Activity

          People

            zhongqishang Qishang Zhong
            Benenson Michael
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: