Flume
  1. Flume
  2. FLUME-1238

Support active rolling of files created by HDFS Event Sink

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: v1.1.0
    • Fix Version/s: v1.2.0
    • Component/s: None
    • Labels:
      None

      Description

      The HDFS Event Sink uses lazy rolling for closing files that are being written to. This results in many files being open for longer than their expected roll-interval if they are not actively written to and can even last in open state until the sink shuts down.

      It will be preferable to have these files roll proactively rather than waiting for a write to trigger the roll.

      1. FLUME-1238-1.patch
        29 kB
        Mike Percy
      2. FLUME-1238-2.patch
        30 kB
        Mike Percy

        Activity

        Hide
        Juhani Connolly added a comment -

        I just ran into something similar to this too. On our test setup we're naming files by the hour. The last rolled file for each hour isn't getting closed properly, with the name still being xyz-timestamp.tmp(I guess it eventually will once we hit a file limit? I think there was some code floating about to limit the number of open files, closing the oldest as necessary)

        Any thoughts on how to decide when to roll? Add a configuration for a fileroller, allowing configuration for how long it waits till closing inactive files?

        Show
        Juhani Connolly added a comment - I just ran into something similar to this too. On our test setup we're naming files by the hour. The last rolled file for each hour isn't getting closed properly, with the name still being xyz-timestamp.tmp(I guess it eventually will once we hit a file limit? I think there was some code floating about to limit the number of open files, closing the oldest as necessary) Any thoughts on how to decide when to roll? Add a configuration for a fileroller, allowing configuration for how long it waits till closing inactive files?
        Hide
        Mike Percy added a comment -

        Seems like having a separate thread that watches the open files should do the trick. It can also keep track of rollInterval.

        I'll take a stab at this one.

        Show
        Mike Percy added a comment - Seems like having a separate thread that watches the open files should do the trick. It can also keep track of rollInterval. I'll take a stab at this one.
        Hide
        Juhani Connolly added a comment -

        Was having a look through the code myself, but go ahead. It looks like you could probably add in some logic to the WriterLinkedHashMap to keep tracking of last access time to a bucketwriter local to one place in code. It would involve some threading issues that would require some care, but it looks like a tractable solution that doesn't add logic all over the place.

        Show
        Juhani Connolly added a comment - Was having a look through the code myself, but go ahead. It looks like you could probably add in some logic to the WriterLinkedHashMap to keep tracking of last access time to a bucketwriter local to one place in code. It would involve some threading issues that would require some care, but it looks like a tractable solution that doesn't add logic all over the place.
        Hide
        Mike Percy added a comment -

        Thanks for letting me pick it up, I'm fresh off looking at this section of the code and I need a solution pretty urgently.

        Best,
        Mike

        Show
        Mike Percy added a comment - Thanks for letting me pick it up, I'm fresh off looking at this section of the code and I need a solution pretty urgently. Best, Mike
        Hide
        Mike Percy added a comment -

        Patch is attached.

        Description of changes:

        • Now actively closes HDFS files when the rollInterval occurs
        • Adds a configurable-size scheduled thread pool per HDFS sink for scheduling the file rolling
        • Moves knowledge of the principal and doAs() down one level into the BucketWriter
        • Required a little bit of refactoring of the public methods in BucketWriter to support this in a multi-threaded fashion
        • Updated unit test to verify that files are actively rolled
        • Updated user guide with docs for the configuration variable

        As well as:

        • Minor refactoring in BucketWriter to make more variables final and thus codify/enforce more invariants
        • Minor refactoring in HDFSEventSink to remove now-unused ProxyCallable
        • Minor refactoring in HDFSEventSink to remove unneeded parameters and methods related to call timeouts
        • Also added convenience functions to HDFSEventSink for time-limited append, flush, and close to make the code more readable
        Show
        Mike Percy added a comment - Patch is attached. Description of changes: Now actively closes HDFS files when the rollInterval occurs Adds a configurable-size scheduled thread pool per HDFS sink for scheduling the file rolling Moves knowledge of the principal and doAs() down one level into the BucketWriter Required a little bit of refactoring of the public methods in BucketWriter to support this in a multi-threaded fashion Updated unit test to verify that files are actively rolled Updated user guide with docs for the configuration variable As well as: Minor refactoring in BucketWriter to make more variables final and thus codify/enforce more invariants Minor refactoring in HDFSEventSink to remove now-unused ProxyCallable Minor refactoring in HDFSEventSink to remove unneeded parameters and methods related to call timeouts Also added convenience functions to HDFSEventSink for time-limited append, flush, and close to make the code more readable
        Hide
        Mike Percy added a comment -

        Review Board is down, so this patch is up for review in JIRA

        Show
        Mike Percy added a comment - Review Board is down, so this patch is up for review in JIRA
        Hide
        Arvind Prabhakar added a comment -

        Thanks for the patch Mike. I will take a look at it right away.

        Show
        Arvind Prabhakar added a comment - Thanks for the patch Mike. I will take a look at it right away.
        Hide
        Arvind Prabhakar added a comment -

        Changes look good Mike! The one thing I observed is that one of the tests has a race condition and fails at times - TestBucketWriter.testIntervalRoller(). On my machine it fails almost all the time. One nit about this test is that it uses time based calculations for asserts - which makes it hard to debug with step-in debugger.

        Show
        Arvind Prabhakar added a comment - Changes look good Mike! The one thing I observed is that one of the tests has a race condition and fails at times - TestBucketWriter.testIntervalRoller(). On my machine it fails almost all the time. One nit about this test is that it uses time based calculations for asserts - which makes it hard to debug with step-in debugger.
        Hide
        Mike Percy added a comment -

        Arvind, thanks for the review!

        Attached is an updated patch to make the interval test more deterministic.

        Show
        Mike Percy added a comment - Arvind, thanks for the review! Attached is an updated patch to make the interval test more deterministic.
        Hide
        Arvind Prabhakar added a comment -

        +1!

        Changes look good, tests run clean.

        Show
        Arvind Prabhakar added a comment - +1! Changes look good, tests run clean.
        Hide
        Arvind Prabhakar added a comment -

        Patch committed. Thanks Mike!

        Show
        Arvind Prabhakar added a comment - Patch committed. Thanks Mike!
        Hide
        Hudson added a comment -

        Integrated in flume-trunk #227 (See https://builds.apache.org/job/flume-trunk/227/)
        FLUME-1238. Support active rolling of files created by HDFS Event Sink.

        (Mike Percy via Arvind Prabhakar) (Revision 1347216)

        Result = SUCCESS
        arvind : http://svn.apache.org/viewvc/?view=rev&rev=1347216
        Files :

        • /incubator/flume/trunk/flume-ng-doc/sphinx/FlumeUserGuide.rst
        • /incubator/flume/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java
        • /incubator/flume/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java
        • /incubator/flume/trunk/flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/MockHDFSWriter.java
        • /incubator/flume/trunk/flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestBucketWriter.java
        Show
        Hudson added a comment - Integrated in flume-trunk #227 (See https://builds.apache.org/job/flume-trunk/227/ ) FLUME-1238 . Support active rolling of files created by HDFS Event Sink. (Mike Percy via Arvind Prabhakar) (Revision 1347216) Result = SUCCESS arvind : http://svn.apache.org/viewvc/?view=rev&rev=1347216 Files : /incubator/flume/trunk/flume-ng-doc/sphinx/FlumeUserGuide.rst /incubator/flume/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java /incubator/flume/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java /incubator/flume/trunk/flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/MockHDFSWriter.java /incubator/flume/trunk/flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestBucketWriter.java

          People

          • Assignee:
            Mike Percy
            Reporter:
            Arvind Prabhakar
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development