[FLINK-8487] State loss after multiple restart attempts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.3.2
Fix Version/s: 1.3.3, 1.4.3, 1.5.0
Component/s: Runtime / State Backends
Labels:
None

Description

A user reported this issue on the user@f.a.o mailing list and analyzed the situation.

Scenario:

A program that reads from Kafka and computes counts in a keyed 15 minute tumbling window. StateBackend is RocksDB and checkpointing is enabled.

keyBy(0)
        .timeWindow(Time.of(window_size, TimeUnit.MINUTES))
        .allowedLateness(Time.of(late_by, TimeUnit.SECONDS))
        .reduce(new ReduceFunction(), new WindowFunction())

At some point HDFS went into a safe mode due to NameNode issues
The following exception was thrown

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby. Visit https://s.apache.org/sbnn-error
    ..................

    at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:453)
        at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.mkdirs(SafetyNetWrapperFileSystem.java:111)
        at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory.createBasePath(FsCheckpointStreamFactory.java:132)

The pipeline came back after a few restarts and checkpoint failures, after the HDFS issues were resolved.

It was evident that operator state was lost. Either it was the Kafka consumer that kept on advancing it's offset between a start and the next checkpoint failure (a minute's worth) or the the operator that had partial aggregates was lost.

The user did some in-depth analysis (see mail thread) and might have (according to aljoscha) identified the problem.

stefanrichter83@gmail.com, can you have a look at this issue and check if it is relevant?

Attachments

Issue Links

links to

GitHub Pull Request #5654

GitHub Pull Request #5655

GitHub Pull Request #5656

Activity

People

Assignee:: Aljoscha Krettek

Reporter:: Fabian Hueske

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 23/Jan/18 09:12

Updated:: 11/Mar/18 15:44

Resolved:: 11/Mar/18 15:44