Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2595

Add option to checkpoint on file channel shutdown

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.1
    • Fix Version/s: 1.6.0
    • Component/s: File Channel
    • Labels:

      Description

      Restarting a File channel based agent can take very long, a few hours to evens days in some cases. The reason being the slow replay performance.

      By having the file channel checkpoint on shutdown, this long replay can avoided as there will be nothing to replay on startup.

      A small hit (a few seconds perhaps) will be observed on shutdown to perform the checkpoint.

      1. FLUME-2595.patch
        7 kB
        Roshan Naik
      2. FLUME-2595.v2.patch
        9 kB
        Roshan Naik
      3. FLUME-2595.v3.patch
        9 kB
        Roshan Naik
      4. FLUME-2595.v4.patch
        12 kB
        Roshan Naik
      5. FLUME-2595.v5.patch
        13 kB
        Hari Shreedharan

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in Flume-trunk-hbase-98 #88 (See https://builds.apache.org/job/Flume-trunk-hbase-98/88/)
          FLUME-2595. Add option to checkpoint on file channel shutdown (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=be4ae294ca549648f785e7eea7564ee95112130a)

          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Log.java
          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannel.java
          • flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestLog.java
          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannelConfiguration.java
          • flume-tools/src/test/java/org/apache/flume/tools/TestFileChannelIntegrityTool.java
          • flume-ng-doc/sphinx/FlumeUserGuide.rst
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in Flume-trunk-hbase-98 #88 (See https://builds.apache.org/job/Flume-trunk-hbase-98/88/ ) FLUME-2595 . Add option to checkpoint on file channel shutdown (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=be4ae294ca549648f785e7eea7564ee95112130a ) flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Log.java flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannel.java flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestLog.java flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannelConfiguration.java flume-tools/src/test/java/org/apache/flume/tools/TestFileChannelIntegrityTool.java flume-ng-doc/sphinx/FlumeUserGuide.rst
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Committed! Thanks Roshan!

          Show
          hshreedharan Hari Shreedharan added a comment - Committed! Thanks Roshan!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 2006a57478c484639ab343bbbf18bfe796d56172 in flume's branch refs/heads/flume-1.6 from Hari Shreedharan
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=2006a57 ]

          FLUME-2595. Add option to checkpoint on file channel shutdown

          (Roshan Naik via Hari)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 2006a57478c484639ab343bbbf18bfe796d56172 in flume's branch refs/heads/flume-1.6 from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=2006a57 ] FLUME-2595 . Add option to checkpoint on file channel shutdown (Roshan Naik via Hari)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit be4ae294ca549648f785e7eea7564ee95112130a in flume's branch refs/heads/trunk from Hari Shreedharan
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=be4ae29 ]

          FLUME-2595. Add option to checkpoint on file channel shutdown

          (Roshan Naik via Hari)

          Show
          jira-bot ASF subversion and git services added a comment - Commit be4ae294ca549648f785e7eea7564ee95112130a in flume's branch refs/heads/trunk from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=be4ae29 ] FLUME-2595 . Add option to checkpoint on file channel shutdown (Roshan Naik via Hari)
          Hide
          roshan_naik Roshan Naik added a comment -

          Just took a look at it. Looks fine. please go ahead.

          Show
          roshan_naik Roshan Naik added a comment - Just took a look at it. Looks fine. please go ahead.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          TestFileChannelIntegrityTool.java

          Show
          hshreedharan Hari Shreedharan added a comment - TestFileChannelIntegrityTool.java
          Hide
          roshan_naik Roshan Naik added a comment -

          Hari Shreedharan, in which .java file was the change ?

          Show
          roshan_naik Roshan Naik added a comment - Hari Shreedharan , in which .java file was the change ?
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Roshan Naik - Please take a look at the change I made. I am going to commit this if you agree!

          Thanks!

          Show
          hshreedharan Hari Shreedharan added a comment - Roshan Naik - Please take a look at the change I made. I am going to commit this if you agree! Thanks!
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Ah, so the issue is that the withCheckpoint tests actually reuse the originalCheckpoint directory. So once one of the tests run, we update the checkpoint directory when we close the channel with the "new" state of the channel with all events taken out - so the next test sees an emptry channel. It is a test bug. I am going to fix it and upload a patch in a few mins.

          Show
          hshreedharan Hari Shreedharan added a comment - Ah, so the issue is that the withCheckpoint tests actually reuse the originalCheckpoint directory. So once one of the tests run, we update the checkpoint directory when we close the channel with the "new" state of the channel with all events taken out - so the next test sees an emptry channel. It is a test bug. I am going to fix it and upload a patch in a few mins.
          Hide
          roshan_naik Roshan Naik added a comment -

          Thats what i thought too but didnt know enough. I dont think the en/di-sabling of checkpoint on close really makes a diff ... could use some help here .. Running it independently works well. I spent a little time trying to figure it out but wasnt obvious.

          Show
          roshan_naik Roshan Naik added a comment - Thats what i thought too but didnt know enough. I dont think the en/di-sabling of checkpoint on close really makes a diff ... could use some help here .. Running it independently works well. I spent a little time trying to figure it out but wasnt obvious.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          What concerns me is why is there a failure when the test has a checkpoint written out on close? The tool should not even be looking at the checkpoint - so whatever the state of the checkpoint is, it should not matter at all. Can you figure out why the test is failing if checkpointOnClose is true? We should figure that out and not disable the checkpointing on close.

          Show
          hshreedharan Hari Shreedharan added a comment - What concerns me is why is there a failure when the test has a checkpoint written out on close? The tool should not even be looking at the checkpoint - so whatever the state of the checkpoint is, it should not matter at all. Can you figure out why the test is failing if checkpointOnClose is true? We should figure that out and not disable the checkpointing on close.
          Hide
          roshan_naik Roshan Naik added a comment -

          Hari Shreedharan Uploading revised patch for fixing FileChannelIntegrityTool UT failure. The UT fix is not ideal. The failure seems to be caused due to side effects from another test case. Temporarily working around by reordering the tests.

          Show
          roshan_naik Roshan Naik added a comment - Hari Shreedharan Uploading revised patch for fixing FileChannelIntegrityTool UT failure. The UT fix is not ideal. The failure seems to be caused due to side effects from another test case. Temporarily working around by reordering the tests.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Roshan Naik - FileChannelIntegrityTool tests still fail:

          Running org.apache.flume.tools.TestFileChannelIntegrityTool
          Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.604 sec <<< FAILURE!
          testFixCorruptRecordsWithCheckpoint(org.apache.flume.tools.TestFileChannelIntegrityTool)  Time elapsed: 148 sec  <<< FAILURE!
          java.lang.AssertionError: expected:<13> but was:<0>
          	at org.junit.Assert.fail(Assert.java:93)
          	at org.junit.Assert.failNotEquals(Assert.java:647)
          	at org.junit.Assert.assertEquals(Assert.java:128)
          	at org.junit.Assert.assertEquals(Assert.java:472)
          	at org.junit.Assert.assertEquals(Assert.java:456)
          	at org.apache.flume.tools.TestFileChannelIntegrityTool.doTestFixCorruptEvents(TestFileChannelIntegrityTool.java:220)
          	at org.apache.flume.tools.TestFileChannelIntegrityTool.testFixCorruptRecordsWithCheckpoint(TestFileChannelIntegrityTool.java:145)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          	at java.lang.reflect.Method.invoke(Method.java:606)
          	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
          	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
          	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
          	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
          	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
          	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
          	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
          	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
          	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
          	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
          	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
          	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
          	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
          	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
          	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
          	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
          	at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
          	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
          	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
          	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          	at java.lang.reflect.Method.invoke(Method.java:606)
          	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
          	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
          	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
          	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
          	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
          
          
          
          Show
          hshreedharan Hari Shreedharan added a comment - Roshan Naik - FileChannelIntegrityTool tests still fail: Running org.apache.flume.tools.TestFileChannelIntegrityTool Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.604 sec <<< FAILURE! testFixCorruptRecordsWithCheckpoint(org.apache.flume.tools.TestFileChannelIntegrityTool) Time elapsed: 148 sec <<< FAILURE! java.lang.AssertionError: expected:<13> but was:<0> at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.flume.tools.TestFileChannelIntegrityTool.doTestFixCorruptEvents(TestFileChannelIntegrityTool.java:220) at org.apache.flume.tools.TestFileChannelIntegrityTool.testFixCorruptRecordsWithCheckpoint(TestFileChannelIntegrityTool.java:145) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
          Hide
          roshan_naik Roshan Naik added a comment -

          Hari Shreedharan this patch fixes all UT failures in TestLog. Got a clean run on all FC UTs as well.

          Show
          roshan_naik Roshan Naik added a comment - Hari Shreedharan this patch fixes all UT failures in TestLog. Got a clean run on all FC UTs as well.
          Hide
          roshan_naik Roshan Naik added a comment -

          Ok. Will try to look into it today.

          Show
          roshan_naik Roshan Naik added a comment - Ok. Will try to look into it today.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Roshan Naik - This fails the FileChannelIntegrityTool tests. Can you take a look?

          Show
          hshreedharan Hari Shreedharan added a comment - Roshan Naik - This fails the FileChannelIntegrityTool tests. Can you take a look?
          Hide
          hshreedharan Hari Shreedharan added a comment -

          There is a bug in TestLog that causes testPutCommit to fail. This is because the tests expect that the queue gets reloaded from the log files since there was no checkpoint being written earlier, so in addition to log.commitPut, the events are not being added to the eventQueue (which is done by the FileBackedTransaction in normal operation).

          I fixed this and I am committing this.

          So, +1.

          Show
          hshreedharan Hari Shreedharan added a comment - There is a bug in TestLog that causes testPutCommit to fail. This is because the tests expect that the queue gets reloaded from the log files since there was no checkpoint being written earlier, so in addition to log.commitPut, the events are not being added to the eventQueue (which is done by the FileBackedTransaction in normal operation). I fixed this and I am committing this. So, +1.
          Hide
          roshan_naik Roshan Naik added a comment -

          Revised patch:

          • locking first
          • added unit test
          Show
          roshan_naik Roshan Naik added a comment - Revised patch: locking first added unit test
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Yep.

          Show
          hshreedharan Hari Shreedharan added a comment - Yep.
          Hide
          roshan_naik Roshan Naik added a comment -

          Hari Shreedharan
          Sounds like the order of things you are suggesting is as follows?

          • grab lock
          • set open=false
          • checkpoint
          • proceed to close

          -roshan

          Show
          roshan_naik Roshan Naik added a comment - Hari Shreedharan Sounds like the order of things you are suggesting is as follows? grab lock set open=false checkpoint proceed to close -roshan
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Looks like the patch was not updated based on the comments. If you do so, I can commit this to 1.6. Also, I think we should grab the exclusive lock, then set open = false.

          Show
          hshreedharan Hari Shreedharan added a comment - Looks like the patch was not updated based on the comments. If you do so, I can commit this to 1.6. Also, I think we should grab the exclusive lock, then set open = false.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Yes sir. That is right.

          Also please add a unit test.

          Show
          hshreedharan Hari Shreedharan added a comment - Yes sir. That is right. Also please add a unit test.
          Hide
          roshan_naik Roshan Naik added a comment - - edited

          Hari Shreedharan

          So ur saying basically .. during FC close

          • set the open=false first
          • grab the exclusive lock
          • initiate checkpoint (wont block due to re-entrant lock)
          • proceed with closing ..

          is that right ?

          Show
          roshan_naik Roshan Naik added a comment - - edited Hari Shreedharan So ur saying basically .. during FC close set the open=false first grab the exclusive lock initiate checkpoint (wont block due to re-entrant lock) proceed with closing .. is that right ?
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Roshan Naik - I posted a review on RB, just one comment.

          Show
          hshreedharan Hari Shreedharan added a comment - Roshan Naik - I posted a review on RB, just one comment.
          Hide
          roshan_naik Roshan Naik added a comment -

          Uploading patch

          Show
          roshan_naik Roshan Naik added a comment - Uploading patch
          Hide
          roshan_naik Roshan Naik added a comment -

          indeed. that was my intent also.

          Show
          roshan_naik Roshan Naik added a comment - indeed. that was my intent also.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Btw, I think we should make this behavior default and have an option to disable it. The shutdown might be slightly slower, but I think that is fine.

          Show
          hshreedharan Hari Shreedharan added a comment - Btw, I think we should make this behavior default and have an option to disable it. The shutdown might be slightly slower, but I think that is fine.
          Hide
          roshan_naik Roshan Naik added a comment - - edited

          Yes i was planning to address this before attacking FLUME-2581

          Show
          roshan_naik Roshan Naik added a comment - - edited Yes i was planning to address this before attacking FLUME-2581
          Hide
          hshreedharan Hari Shreedharan added a comment -

          To be fair, I have not seen replay speed of more than a few minutes after FLUME-2155 (which removed a specific edge case that cause the hours-long replay in some cases), but did cause the normal case to become worse - which was fixed by FLUME-2450. So the current replay speed should be in the seconds-range.

          Of course, doing a checkpoint would definitely improve performance on startup.

          Roshan Naik - Are you planning to work on this one?

          Show
          hshreedharan Hari Shreedharan added a comment - To be fair, I have not seen replay speed of more than a few minutes after FLUME-2155 (which removed a specific edge case that cause the hours-long replay in some cases), but did cause the normal case to become worse - which was fixed by FLUME-2450 . So the current replay speed should be in the seconds-range. Of course, doing a checkpoint would definitely improve performance on startup. Roshan Naik - Are you planning to work on this one?

            People

            • Assignee:
              roshan_naik Roshan Naik
              Reporter:
              roshan_naik Roshan Naik
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development