Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12381

W/o HA, upon a full restart, checkpointing crashes

    XMLWordPrintableJSON

Details

    Description

      Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 'gs://example_bucket/flink/checkpoints/00000000000000000000000000000000/chk-16/_metadata' already exists
          at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
          at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:74)
          at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
          at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
          at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
          at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
          at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
          at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
          at org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.<init>(FsCheckpointMetadataOutputStream.java:65)
          at org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
          at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
          at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
          ... 8 more
      

      Instead, it should either just overwrite the checkpoint or fail to start the job completely. Partial and undefined failure is not what should happen.

       

      Repro:

      1. Set up a single purpose job cluster (which could use much better docs btw!)
      2. Let it run with GCS checkpointing for a while with rocksdb/gs://example
      3. Kill it
      4. Start it

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              haf Henrik
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m