Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-1929

CheckpointRebuilder main method does not work

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: None
    • Labels:
      None

      Description

      Based on the discussion in this thread: http://apache.markmail.org/thread/567cshrmz35okrq3 - the main method in CheckpointRebuilder was not updated for the new data file format.

      1. FLUME-1929.patch
        1.0 kB
        Hari Shreedharan
      2. cp-rebuild-stack.log
        6 kB
        Juhani Connolly
      3. FLUME-1929-1.patch
        2 kB
        Hari Shreedharan

        Activity

        Hide
        juhanic Juhani Connolly added a comment -

        This appears to hang.

        Steps followed:

        • start up flume, feed some data -kill 9 to try to force an inconsistent checkpoint
        • delete in-use.lock, checkpoint and checkpoint.meta
        • run the checkpoint rebuilder, final command through our script is(not that I patched -c to become -h)

        + exec /usr/local/java/bin/java -server -XX:OnOutOfMemoryError=/tmp/stop.sh -XX:MaxPermSize=24m -XX:PermSize=24m -XX:SurvivorRatio=8 -Xmn96m -Xmx512m -Xms128m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=12345 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.rmi.server.hostname=172.28.202.76 -Dflume.monitoring.type=GANGLIA -Dflume.monitoring.hosts=pat-log-om01:8649 -cp '/etc/flume/conf:/usr/lib/flume/lib/*' -Djava.library.path= org.apache.flume.channel.file.CheckpointRebuilder -h /tmp/flume-check -l /tmp/flume-data -t 5000000

        Full logs are as below:

        27 Feb 2013 17:51:35,995 INFO [main] (org.apache.flume.channel.file.EventQueueBackingStoreFile.<init>:71) - Preallocated /tmp/flume-check/checkpoint to 40008232 for capacity 5000000
        27 Feb 2013 17:51:36,004 INFO [main] (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:47) - Starting up with /tmp/flume-check/checkpoint and /tmp/flume-check/checkpoint.meta
        27 Feb 2013 17:51:36,078 INFO [main] (org.apache.flume.channel.file.CheckpointRebuilder.rebuild:64) - Attempting to fast replay the log files.
        27 Feb 2013 17:51:36,112 INFO [main] (org.apache.flume.tools.DirectMemoryUtils.getDefaultDirectMemorySize:113) - Unable to get maxDirectMemory from VM: NoSuchMethodException: sun.misc.VM.maxDirectMemory(null)
        27 Feb 2013 17:51:36,117 INFO [main] (org.apache.flume.tools.DirectMemoryUtils.allocate:47) - Direct Memory Allocation: Allocation = 1048576, Allocated = 0, MaxDirectMemorySize = 526843904, Remaining = 526843904
        27 Feb 2013 17:51:36,866 INFO [main] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 150457 in /tmp/flume-data/log-3
        27 Feb 2013 17:51:36,884 INFO [main] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 4095 in /tmp/flume-data/log-4
        27 Feb 2013 17:51:36,887 INFO [main] (org.apache.flume.channel.file.CheckpointRebuilder.rebuild:151) - Replayed 0 events using fast replay logic.
        27 Feb 2013 17:51:36,889 INFO [main] (org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:108) - Start checkpoint for /tmp/flume-check/checkpoint, elements to sync = 0
        27 Feb 2013 17:51:36,896 INFO [main] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:120) - Updating checkpoint metadata: logWriteOrderID: 1361955096886, queueSize: 0, queueHead: 0
        27 Feb 2013 17:51:36,906 INFO [main] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-3.meta currentPosition = 0, logWriteOrderID = 1361955096886
        27 Feb 2013 17:51:36,908 INFO [main] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-4.meta currentPosition = 4095, logWriteOrderID = 1361955096886

        Some diagnostics:

        1. lsof +d /tmp/flume-data
          COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
          bash 15144 juhani_connolly cwd DIR 252,0 4096 132605 /tmp/flume-data
          sudo 16392 root cwd DIR 252,0 4096 132605 /tmp/flume-data
          lsof 16394 root cwd DIR 252,0 4096 132605 /tmp/flume-data
          lsof 16395 root cwd DIR 252,0 4096 132605 /tmp/flume-data

        Attaching thread dump

        Show
        juhanic Juhani Connolly added a comment - This appears to hang. Steps followed: start up flume, feed some data -kill 9 to try to force an inconsistent checkpoint delete in-use.lock, checkpoint and checkpoint.meta run the checkpoint rebuilder, final command through our script is(not that I patched -c to become -h) + exec /usr/local/java/bin/java -server -XX:OnOutOfMemoryError=/tmp/stop.sh -XX:MaxPermSize=24m -XX:PermSize=24m -XX:SurvivorRatio=8 -Xmn96m -Xmx512m -Xms128m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=12345 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.rmi.server.hostname=172.28.202.76 -Dflume.monitoring.type=GANGLIA -Dflume.monitoring.hosts=pat-log-om01:8649 -cp '/etc/flume/conf:/usr/lib/flume/lib/*' -Djava.library.path= org.apache.flume.channel.file.CheckpointRebuilder -h /tmp/flume-check -l /tmp/flume-data -t 5000000 Full logs are as below: 27 Feb 2013 17:51:35,995 INFO [main] (org.apache.flume.channel.file.EventQueueBackingStoreFile.<init>:71) - Preallocated /tmp/flume-check/checkpoint to 40008232 for capacity 5000000 27 Feb 2013 17:51:36,004 INFO [main] (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:47) - Starting up with /tmp/flume-check/checkpoint and /tmp/flume-check/checkpoint.meta 27 Feb 2013 17:51:36,078 INFO [main] (org.apache.flume.channel.file.CheckpointRebuilder.rebuild:64) - Attempting to fast replay the log files. 27 Feb 2013 17:51:36,112 INFO [main] (org.apache.flume.tools.DirectMemoryUtils.getDefaultDirectMemorySize:113) - Unable to get maxDirectMemory from VM: NoSuchMethodException: sun.misc.VM.maxDirectMemory(null) 27 Feb 2013 17:51:36,117 INFO [main] (org.apache.flume.tools.DirectMemoryUtils.allocate:47) - Direct Memory Allocation: Allocation = 1048576, Allocated = 0, MaxDirectMemorySize = 526843904, Remaining = 526843904 27 Feb 2013 17:51:36,866 INFO [main] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 150457 in /tmp/flume-data/log-3 27 Feb 2013 17:51:36,884 INFO [main] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 4095 in /tmp/flume-data/log-4 27 Feb 2013 17:51:36,887 INFO [main] (org.apache.flume.channel.file.CheckpointRebuilder.rebuild:151) - Replayed 0 events using fast replay logic. 27 Feb 2013 17:51:36,889 INFO [main] (org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:108) - Start checkpoint for /tmp/flume-check/checkpoint, elements to sync = 0 27 Feb 2013 17:51:36,896 INFO [main] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:120) - Updating checkpoint metadata: logWriteOrderID: 1361955096886, queueSize: 0, queueHead: 0 27 Feb 2013 17:51:36,906 INFO [main] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-3.meta currentPosition = 0, logWriteOrderID = 1361955096886 27 Feb 2013 17:51:36,908 INFO [main] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-4.meta currentPosition = 4095, logWriteOrderID = 1361955096886 Some diagnostics: lsof +d /tmp/flume-data COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 15144 juhani_connolly cwd DIR 252,0 4096 132605 /tmp/flume-data sudo 16392 root cwd DIR 252,0 4096 132605 /tmp/flume-data lsof 16394 root cwd DIR 252,0 4096 132605 /tmp/flume-data lsof 16395 root cwd DIR 252,0 4096 132605 /tmp/flume-data Attaching thread dump
        Hide
        juhanic Juhani Connolly added a comment -

        stack dump when trying to rebuild via CheckpointBuilder.main()

        Show
        juhanic Juhani Connolly added a comment - stack dump when trying to rebuild via CheckpointBuilder.main()
        Hide
        hshreedharan Hari Shreedharan added a comment -

        Looks like the checkpoint rebuilder is done. Not sure why the jvm did not exit.

        Show
        hshreedharan Hari Shreedharan added a comment - Looks like the checkpoint rebuilder is done. Not sure why the jvm did not exit.
        Hide
        hshreedharan Hari Shreedharan added a comment -

        Are you using latest trunk?

        Show
        hshreedharan Hari Shreedharan added a comment - Are you using latest trunk?
        Hide
        juhanic Juhani Connolly added a comment - - edited

        not quite, a few patches behind. I'll try building against the trunk just in case

        edit: looks like we were only one patch behind. The build I was testing was based on 102c5e07dec17740866315d342afc00c19267569 (up to FLUME-1765). hdfs-sink shouldn't be in any way related.

        Show
        juhanic Juhani Connolly added a comment - - edited not quite, a few patches behind. I'll try building against the trunk just in case edit: looks like we were only one patch behind. The build I was testing was based on 102c5e07dec17740866315d342afc00c19267569 (up to FLUME-1765 ). hdfs-sink shouldn't be in any way related.
        Hide
        hshreedharan Hari Shreedharan added a comment -

        After applying FLUME-1930 and this new patch, the JVM should shutdown. Even without it, the checkpoint is actually generated - just that the executors not terminating causes the JVM to wait.

        Show
        hshreedharan Hari Shreedharan added a comment - After applying FLUME-1930 and this new patch, the JVM should shutdown. Even without it, the checkpoint is actually generated - just that the executors not terminating causes the JVM to wait.
        Hide
        hshreedharan Hari Shreedharan added a comment -

        Also for running fast replay using checkpoint rebuilder - you should probably give a lot more memory than -Xmx512m -Xms128m. This will read all events into memory and then write them out.

        Show
        hshreedharan Hari Shreedharan added a comment - Also for running fast replay using checkpoint rebuilder - you should probably give a lot more memory than -Xmx512m -Xms128m. This will read all events into memory and then write them out.
        Hide
        juhanic Juhani Connolly added a comment -

        Thanks Hari!

        I'm actually against a deadline on something else, but I'll verify this after the weekend and commit it then

        Show
        juhanic Juhani Connolly added a comment - Thanks Hari! I'm actually against a deadline on something else, but I'll verify this after the weekend and commit it then
        Hide
        juhanic Juhani Connolly added a comment -

        Finally got to check this properly.

        It works fine, and the checkpoint is valid but even with FLUME-1930 patched in it still hangs at the end. I'll post details there, not a big deal

        Show
        juhanic Juhani Connolly added a comment - Finally got to check this properly. It works fine, and the checkpoint is valid but even with FLUME-1930 patched in it still hangs at the end. I'll post details there, not a big deal
        Hide
        hudson Hudson added a comment -

        Integrated in flume-trunk #376 (See https://builds.apache.org/job/flume-trunk/376/)
        FLUME-1929: CheckpointRebuilder main method updated to work for the latest Log format (Revision 082cfb498e95ae95e88d49d357839eab8ab3bf33)

        Result = ABORTED
        juhani_connolly : http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=082cfb498e95ae95e88d49d357839eab8ab3bf33
        Files :

        • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/CheckpointRebuilder.java
        Show
        hudson Hudson added a comment - Integrated in flume-trunk #376 (See https://builds.apache.org/job/flume-trunk/376/ ) FLUME-1929 : CheckpointRebuilder main method updated to work for the latest Log format (Revision 082cfb498e95ae95e88d49d357839eab8ab3bf33) Result = ABORTED juhani_connolly : http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=082cfb498e95ae95e88d49d357839eab8ab3bf33 Files : flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/CheckpointRebuilder.java

          People

          • Assignee:
            hshreedharan Hari Shreedharan
            Reporter:
            hshreedharan Hari Shreedharan
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development