[FLINK-32754] Using SplitEnumeratorContext.metricGroup() in restoreEnumerator causes NPE - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Information Provided
Affects Version/s: 1.17.0, 1.17.1
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:
None

Description

We registered some metrics in the `enumerator` of the flip-27 source via `SplitEnumerator.metricGroup()`, but found that the task prints NPE logs in JM when restoring, suggesting that `SplitEnumerator. metricGroup()` is null.
Meanwhile, the task does not experience failover, and the Checkpoints cannot be successfully created even after the task is in running state.

We found that the implementation class of `SplitEnumerator` is `LazyInitializedCoordinatorContext`, however, the metricGroup() is initialized after calling lazyInitialize(). By reviewing the code, we found that at the time of SourceCoordinator.resetToCheckpoint(), lazyInitialize() has not been called yet, so NPE is thrown.

Q: Why does this bug prevent the task from creating the Checkpoint?
`SourceCoordinator.resetToCheckpoint()` throws an NPE which results in the member variable `enumerator` in `SourceCoordinator` being null. Unfortunately, all Checkpoint-related calls in `SourceCoordinator` are called via `runInEventLoop()`.
In `runInEventLoop()`, if the enumerator is null, it will return directly.

Q: Why this bug doesn't trigger a task failover?
In `RecreateOnResetOperatorCoordinator.resetAndStart()`, if `internalCoordinator.resetToCheckpoint` throws an exception, then it will catch the exception and call `cleanAndFailJob ` to try to fail the job.
However, `globalFailureHandler` is also initialized in `lazyInitialize()`, while `schedulerExecutor.execute` will ignore the NPE triggered by `globalFailureHandler.handleGlobalFailure(e)`.
Thus it appears that the task did not failover.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2023-08-04-18-28-05-897.png
04/Aug/23 10:28
1008 kB
Yu Chen

Issue Links

is duplicated by

FLINK-31268 OperatorCoordinator.Context#metricGroup will return null when restore from a savepoint

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Yu Chen

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Aug/23 10:53

Updated:: 09/Aug/23 04:03

Resolved:: 09/Aug/23 04:03