Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-3080

Close failure in HDFS Sink might cause data loss

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.7.0
    • Fix Version/s: 1.8.0
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires.
      The lease recovery can be started manually by the hdfs debug recoverLease command.

      For reproduction I removed the close call from the BucketWriter (https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java#L380) to simulate the failure and used the following config:

      agent.sources = testsrc
      agent.sinks = testsink
      agent.channels = testch
      
      agent.sources.testsrc.type = netcat
      agent.sources.testsrc.bind = localhost
      agent.sources.testsrc.port = 9494
      agent.sources.testsrc.channels = testch
      
      agent.sinks.testsink.type = hdfs
      agent.sinks.testsink.hdfs.path = /user/flume/test
      agent.sinks.testsink.hdfs.rollInterval = 0
      agent.sinks.testsink.hdfs.rollCount = 3
      agent.sinks.testsink.serializer = avro_event
      agent.sinks.testsink.serializer.compressionCodec = snappy
      agent.sinks.testsink.hdfs.fileSuffix = .avro
      agent.sinks.testsink.hdfs.fileType = DataStream
      agent.sinks.testsink.hdfs.batchSize = 2
      agent.sinks.testsink.hdfs.writeFormat = Text
      agent.sinks.testsink.hdfs.idleTimeout=20
      agent.sinks.testsink.channel = testch
      
      agent.channels.testch.type = memory
      

      After ingesting 6 events ("a" - "f") 2 files were created on HDFS, as expected. But there are missing events when listing the contents in Spark shell:

      scala> sqlContext.read.avro("/user/flume/test/FlumeData.14908867127*.avro").collect().map(a => new String(a(1).asInstanceOf[Array[Byte]])).foreach(println)
      a
      b
      d
      

      hdfs fsck also confirms that the blocks are still in UNDER_CONSTRUCTION state:

      $ hdfs fsck /user/flume/test/ -openforwrite -files -blocks
      FSCK started by root (auth:SIMPLE) from /172.31.114.3 for path /user/flume/test/ at Thu Mar 30 08:23:56 PDT 2017
      /user/flume/test/ <dir>
      /user/flume/test/FlumeData.1490887185312.avro 310 bytes, 1 block(s), OPENFORWRITE:  MISSING 1 blocks of total size 310 B
      0. BP-1285398861-172.31.114.3-1489845696835:blk_1073761923_21128{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW], ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW], ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW]]} len=310 MISSING!
      
      /user/flume/test/FlumeData.1490887185313.avro 292 bytes, 1 block(s), OPENFORWRITE:  MISSING 1 blocks of total size 292 B
      0. BP-1285398861-172.31.114.3-1489845696835:blk_1073761924_21129{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW], ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW], ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW]]} len=292 MISSING!
      

      These blocks need to be recovered by starting a lease recovery process on the NameNode (which will then run the block recovery as well). This can be triggered programmatically via the DFSClient.
      Adding this call if the close fails solves the issue.

      cc Wei-Chiu Chuang

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in Jenkins build Flume-trunk-hbase-1 #243 (See https://builds.apache.org/job/Flume-trunk-hbase-1/243/)
          FLUME-3080. Close failure in HDFS Sink might cause data loss (bessbd: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=e5c3e6aa76cf2b0bb0838ff6dcd3853656bff704)

          • (edit) flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java
          • (edit) flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestHDFSEventSinkOnMiniCluster.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build Flume-trunk-hbase-1 #243 (See https://builds.apache.org/job/Flume-trunk-hbase-1/243/ ) FLUME-3080 . Close failure in HDFS Sink might cause data loss (bessbd: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=e5c3e6aa76cf2b0bb0838ff6dcd3853656bff704 ) (edit) flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java (edit) flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestHDFSEventSinkOnMiniCluster.java
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flume/pull/127

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flume/pull/127
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit e5c3e6aa76cf2b0bb0838ff6dcd3853656bff704 in flume's branch refs/heads/trunk from Denes Arvay
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=e5c3e6a ]

          FLUME-3080. Close failure in HDFS Sink might cause data loss

          If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might
          not end up in COMPLETE state. In this case block recovery should happen but as the lease is
          still held by Flume the NameNode will start the recovery process only after the hard limit of
          1 hour expires.

          This change adds an explicit recoverLease() call in case of close failure.

          This closes #127

          Reviewers: Hari Shreedharan

          (Denes Arvay via Bessenyei Balázs Donát)

          Show
          jira-bot ASF subversion and git services added a comment - Commit e5c3e6aa76cf2b0bb0838ff6dcd3853656bff704 in flume's branch refs/heads/trunk from Denes Arvay [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=e5c3e6a ] FLUME-3080 . Close failure in HDFS Sink might cause data loss If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires. This change adds an explicit recoverLease() call in case of close failure. This closes #127 Reviewers: Hari Shreedharan (Denes Arvay via Bessenyei Balázs Donát)
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user adenes opened a pull request:

          https://github.com/apache/flume/pull/127

          FLUME-3080. Call DistributedFileSystem.recoverLease() if close() fails to avoid lease leak

          If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires.

          This change adds an explicit recoverLease() call in case of close failure.

          For more details see https://issues.apache.org/jira/browse/FLUME-3080

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/adenes/flume FLUME-3080

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flume/pull/127.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #127


          commit 8cc9082c69ad0aea2e8cfa20e906261a6a417245
          Author: Denes Arvay <denes@cloudera.com>
          Date: 2017-04-03T15:27:19Z

          FLUME-3080. call DistributedFileSystem.recoverLease() if close() fails to avoid lease leak

          If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might
          not end up in COMPLETE state. In this case block recovery should happen but as the lease is
          still held by Flume the NameNode will start the recovery process only after the hard limit of
          1 hour expires.

          This change adds an explicit recoverLease() call in case of close failure.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user adenes opened a pull request: https://github.com/apache/flume/pull/127 FLUME-3080 . Call DistributedFileSystem.recoverLease() if close() fails to avoid lease leak If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires. This change adds an explicit recoverLease() call in case of close failure. For more details see https://issues.apache.org/jira/browse/FLUME-3080 You can merge this pull request into a Git repository by running: $ git pull https://github.com/adenes/flume FLUME-3080 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flume/pull/127.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #127 commit 8cc9082c69ad0aea2e8cfa20e906261a6a417245 Author: Denes Arvay <denes@cloudera.com> Date: 2017-04-03T15:27:19Z FLUME-3080 . call DistributedFileSystem.recoverLease() if close() fails to avoid lease leak If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires. This change adds an explicit recoverLease() call in case of close failure.

            People

            • Assignee:
              denes Denes Arvay
              Reporter:
              denes Denes Arvay
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development