Flume
  1. Flume
  2. FLUME-2245

HDFS files with errors unable to close

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: v1.6.0
    • Component/s: None
    • Labels:
      None

      Description

      This is running on a snapshot of Flume-1.5 with the git hash 99db32ccd163daf9d7685f0e8485941701e1133d

      When a datanode goes unresponsive for a significant amount of time(for example a big gc) an append failure will occur followed by repeated time outs appearing in the log, and failure to close the stream. Relevant section of logs attached(where it first starts appearing.

      The same log repeats periodically, consistently running into a TimeoutException.

      Restarting flume(or presumably just the HDFSSink) solves the issue.

      Probable cause in comments

      1. FLUME-2245.patch
        0.9 kB
        Brock Noland
      2. flume.log.file
        81 kB
        Juhani Connolly
      3. flume.log.1133
        41 kB
        Juhani Connolly

        Issue Links

          Activity

          Hide
          Juhani Connolly added a comment -

          Relevant section of logs

          Show
          Juhani Connolly added a comment - Relevant section of logs
          Hide
          Juhani Connolly added a comment -

          grep for a specific path demonstrating the repeated failures on the same file

          Show
          Juhani Connolly added a comment - grep for a specific path demonstrating the repeated failures on the same file
          Hide
          Juhani Connolly added a comment -

          It appears to me that when an error occurs during an append,

          BucketWriter.close() attempts to call BucketWriter.flush() however this fails, and thus we never reach an attempt to actually close the backing HDFSWriter. As a result of this, isOpen remains true, and the process is repeated constantly.

          Upon examination of the code, the flush() seems entirely unnecessary as HDFSWriter.close() functions will flush and sync the backing buffer before closing it. Is there a reason for it being called separately, and outside the try/catch?

          Further, upon examination of HDFSDataStream, since we're going to be rolling back anyway, that we could call closeHDFSOutputStream() regardless of whether the flush and sync succeed? So long as we throw the exception it should be propagated, and rollback will occur.

          There may be some deeper consequences I'm missing here due to only a passing familiarity with the HDFSSink code. I'll throw up a fix to review board and would appreciate hearing from someone more familiar with the hdfs streams

          Show
          Juhani Connolly added a comment - It appears to me that when an error occurs during an append, BucketWriter.close() attempts to call BucketWriter.flush() however this fails, and thus we never reach an attempt to actually close the backing HDFSWriter. As a result of this, isOpen remains true, and the process is repeated constantly. Upon examination of the code, the flush() seems entirely unnecessary as HDFSWriter.close() functions will flush and sync the backing buffer before closing it. Is there a reason for it being called separately, and outside the try/catch? Further, upon examination of HDFSDataStream, since we're going to be rolling back anyway, that we could call closeHDFSOutputStream() regardless of whether the flush and sync succeed? So long as we throw the exception it should be propagated, and rollback will occur. There may be some deeper consequences I'm missing here due to only a passing familiarity with the HDFSSink code. I'll throw up a fix to review board and would appreciate hearing from someone more familiar with the hdfs streams
          Hide
          Hari Shreedharan added a comment -

          This patch does not really need the changes in the HDFSDataStream and HDFSCompressedStream classes. We should just catch the exception thrown by the flush and try to close. If the close fails, it will get rescheduled anyway.

          Show
          Hari Shreedharan added a comment - This patch does not really need the changes in the HDFSDataStream and HDFSCompressedStream classes. We should just catch the exception thrown by the flush and try to close. If the close fails, it will get rescheduled anyway.
          Hide
          Hari Shreedharan added a comment - - edited

          Juhani Connolly - Do you want to just do that one? If yes, please submit a new patch - I will commit it.

          Show
          Hari Shreedharan added a comment - - edited Juhani Connolly - Do you want to just do that one? If yes, please submit a new patch - I will commit it.
          Hide
          Brock Noland added a comment -

          Hi,

          Yes, I was able to test this the BucketWriter writer change (and only that) and I found it fixed this issue.

          Note: I used kill -STOP on the DN to reproduce.

          Show
          Brock Noland added a comment - Hi, Yes, I was able to test this the BucketWriter writer change (and only that) and I found it fixed this issue. Note: I used kill -STOP on the DN to reproduce.
          Hide
          Hari Shreedharan added a comment -

          Brock - Just putting the flush in a try-catch?

          Show
          Hari Shreedharan added a comment - Brock - Just putting the flush in a try-catch?
          Hide
          Brock Noland added a comment -

          Attached is the patch which fixed the issue for me.

          Show
          Brock Noland added a comment - Attached is the patch which fixed the issue for me.
          Hide
          Brock Noland added a comment -

          Note that it's just the try and catch on the flush.

          Show
          Brock Noland added a comment - Note that it's just the try and catch on the flush.
          Hide
          Juhani Connolly added a comment -

          I've not been working much with flume recently so please feel free to go ahead with this. We've been running something like what I submitted earlier but it would be nice to get rid of that if you can fix it more cleanly.

          Show
          Juhani Connolly added a comment - I've not been working much with flume recently so please feel free to go ahead with this. We've been running something like what I submitted earlier but it would be nice to get rid of that if you can fix it more cleanly.
          Hide
          Hari Shreedharan added a comment -

          +1. I will run tests and commit this tomorrow.

          Show
          Hari Shreedharan added a comment - +1. I will run tests and commit this tomorrow.
          Hide
          Hari Shreedharan added a comment -

          I am going to give both of you credit in the commit message since both your patches made sense

          Show
          Hari Shreedharan added a comment - I am going to give both of you credit in the commit message since both your patches made sense
          Hide
          ASF subversion and git services added a comment -

          Commit 33cdcf0d4e85e68e6df9e1ca4be729889d480246 in flume's branch refs/heads/trunk from Hari Shreedharan
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=33cdcf0 ]

          FLUME-2245. Pre-close flush failure can cause HDFS Sinks to not process events.

          (Juhani Connolly, Brock Noland via Hari Shreedharan)

          Show
          ASF subversion and git services added a comment - Commit 33cdcf0d4e85e68e6df9e1ca4be729889d480246 in flume's branch refs/heads/trunk from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=33cdcf0 ] FLUME-2245 . Pre-close flush failure can cause HDFS Sinks to not process events. (Juhani Connolly, Brock Noland via Hari Shreedharan)
          Hide
          ASF subversion and git services added a comment -

          Commit 6dd12343322cf73acddee7c0c4b73e9b94f44ccc in flume's branch refs/heads/flume-1.6 from Hari Shreedharan
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=6dd1234 ]

          FLUME-2245. Pre-close flush failure can cause HDFS Sinks to not process events.

          (Juhani Connolly, Brock Noland via Hari Shreedharan)

          Show
          ASF subversion and git services added a comment - Commit 6dd12343322cf73acddee7c0c4b73e9b94f44ccc in flume's branch refs/heads/flume-1.6 from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=6dd1234 ] FLUME-2245 . Pre-close flush failure can cause HDFS Sinks to not process events. (Juhani Connolly, Brock Noland via Hari Shreedharan)
          Hide
          Hari Shreedharan added a comment -

          Committed. Thanks Juhani and Brock!

          Show
          Hari Shreedharan added a comment - Committed. Thanks Juhani and Brock!
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in flume-trunk #635 (See https://builds.apache.org/job/flume-trunk/635/)
          FLUME-2245. Pre-close flush failure can cause HDFS Sinks to not process events. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=33cdcf0d4e85e68e6df9e1ca4be729889d480246)

          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java
          Show
          Hudson added a comment - SUCCESS: Integrated in flume-trunk #635 (See https://builds.apache.org/job/flume-trunk/635/ ) FLUME-2245 . Pre-close flush failure can cause HDFS Sinks to not process events. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=33cdcf0d4e85e68e6df9e1ca4be729889d480246 ) flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java

            People

            • Assignee:
              Brock Noland
              Reporter:
              Juhani Connolly
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development