Flume
  1. Flume
  2. FLUME-229

Flume collector not recovering properly if the hdfs namenode goes down.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: v0.9.0, v0.9.1
    • Fix Version/s: v0.9.1u1, v0.9.2
    • Component/s: Node
    • Labels:
      None

      Description

      Via Eric van Dewoestine
      """
      I'm just getting started with flume and attempting to test the fault
      tolerance of the system. We are initially logging to flat files, but
      would like to move to logging straight into hdfs. What I'm attempting
      to test is the case where:

      1. a flume collector receives a message to be logged to hdfs
      2. flume cannot connect to hdfs, long enough to exhaust the 10
      attempts by the hadoop ipc.Client resulting in a
      ConnectionException being logged.
      3. bring hdfs up

      After bringing hdfs up I'd like to see flume auto connect at some
      point so that messages can then be written there, but it appears that
      once that ConnectionException is raised, the only way to get flume to
      connect to hdfs is a restart.

      """

        Issue Links

          Activity

          Jonathan Hsieh created issue -
          Hide
          Jonathan Hsieh added a comment -

          This is a case Flume should definitely to recover from, and from what you are saying, it is not properly handled. This is likely a bug inside of the collectorSink. Could you file an issue mark it as a blocker? Please include your description (and maybe stacktraces too).

          Here's a quick explanation of how collectorSink works. Collector sink internally looks roughly like this:

          ackChecksum => roll (xxx)

          { ackForwardOnRoll => escapedCustomDfs(xxx) }

          ackChecksum looks and checks acked groups of events. It registers then registers received acks in a table.
          roll periodically opens and closes new instances of its sub sink.
          ackForwardOnRoll closes its subsink and then pushes acks to the master and on their way to the agent).
          escapedCustomDfs writes events into hdfs/file/s3 using the output bucket escaping mechanism.

          Data from all modes can be sent through this pipeline, and if it is in e2e mode, acks are only sent when the dfs sink has been closed (and thus flushed).

          My guess is that our test cases may not cover when the connection exception times out and fails completely. (I think the roll would eventually close and attempt to reopen).

          Show
          Jonathan Hsieh added a comment - This is a case Flume should definitely to recover from, and from what you are saying, it is not properly handled. This is likely a bug inside of the collectorSink. Could you file an issue mark it as a blocker? Please include your description (and maybe stacktraces too). Here's a quick explanation of how collectorSink works. Collector sink internally looks roughly like this: ackChecksum => roll (xxx) { ackForwardOnRoll => escapedCustomDfs(xxx) } ackChecksum looks and checks acked groups of events. It registers then registers received acks in a table. roll periodically opens and closes new instances of its sub sink. ackForwardOnRoll closes its subsink and then pushes acks to the master and on their way to the agent). escapedCustomDfs writes events into hdfs/file/s3 using the output bucket escaping mechanism. Data from all modes can be sent through this pipeline, and if it is in e2e mode, acks are only sent when the dfs sink has been closed (and thus flushed). My guess is that our test cases may not cover when the connection exception times out and fails completely. (I think the roll would eventually close and attempt to reopen).
          Jonathan Hsieh made changes -
          Field Original Value New Value
          Assignee Jonathan Hsieh [ jmhsieh ]
          Jonathan Hsieh made changes -
          Component/s Node [ 10036 ]
          Hide
          Jonathan Hsieh added a comment -

          I now have a manually tested fix but need to make sure the updates it handles node life cycle properly, and to make automated unit tests or system-level integration tests.

          Show
          Jonathan Hsieh added a comment - I now have a manually tested fix but need to make sure the updates it handles node life cycle properly, and to make automated unit tests or system-level integration tests.
          Hide
          Jonathan Hsieh added a comment - - edited

          Procedure for manually testing recovery:

          In one terminal:
          $ bin/flume node_nowatch -1 -n foo -c 'foo: asciisynth(0)| {delay(100)=>collectorSink("hdfs://localhost/user/jon/test/","%

          {rolltag}

          ") };'

          In another terminal:
          $ sudo service hadoop-0.20-namenode stop
          Wait for the 1 retry error messages to go by, (eventually they will be repeated again).

          $ sudo service hadoop-0.20-namenode start

          There will be 30 seconds when the name node does not allow writes due to being in safe mode.

          Data starts appearing again.

          Show
          Jonathan Hsieh added a comment - - edited Procedure for manually testing recovery: In one terminal: $ bin/flume node_nowatch -1 -n foo -c 'foo: asciisynth(0)| {delay(100)=>collectorSink("hdfs://localhost/user/jon/test/","% {rolltag} ") };' In another terminal: $ sudo service hadoop-0.20-namenode stop Wait for the 1 retry error messages to go by, (eventually they will be repeated again). $ sudo service hadoop-0.20-namenode start There will be 30 seconds when the name node does not allow writes due to being in safe mode. Data starts appearing again.
          Hide
          Jonathan Hsieh added a comment - - edited

          Procedure for testing reconfiguration on when name node is down:

          exec multiconfig 'host:asciisynth(0)| {delay(100)=>collectorSink("hdfs://localhost/user/jon/test","%

          {rolltag}

          ") };'

          see data appear in specified dir.

          kill name node:
          sudo service hadoop-0.20-namenode stop

          exec config host null null

          (configuration should eventually get updated, likely within 30-60 seconds)

          Show
          Jonathan Hsieh added a comment - - edited Procedure for testing reconfiguration on when name node is down: exec multiconfig 'host:asciisynth(0)| {delay(100)=>collectorSink("hdfs://localhost/user/jon/test","% {rolltag} ") };' see data appear in specified dir. kill name node: sudo service hadoop-0.20-namenode stop exec config host null null (configuration should eventually get updated, likely within 30-60 seconds)
          Hide
          Jonathan Hsieh added a comment - - edited

          I believe the patch works but sometimes a LeaseChecker thread remains alive (this is part of hdfs's DFSClient class). It seems benign in the short term but could be related to the file handle problem FLUME-234

          Show
          Jonathan Hsieh added a comment - - edited I believe the patch works but sometimes a LeaseChecker thread remains alive (this is part of hdfs's DFSClient class). It seems benign in the short term but could be related to the file handle problem FLUME-234
          Jonathan Hsieh made changes -
          Status Open [ 1 ] Patch Available [ 10000 ]
          Hide
          Jonathan Hsieh added a comment -
          Show
          Jonathan Hsieh added a comment - up for review https://review.cloudera.org/r/896/
          Jonathan Hsieh made changes -
          Hide
          Jonathan Hsieh added a comment - - edited

          committed, has follow on issues.

          Show
          Jonathan Hsieh added a comment - - edited committed, has follow on issues.
          Jonathan Hsieh made changes -
          Status Patch Available [ 10000 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Jonathan Hsieh made changes -
          Link This issue relates to FLUME-249 [ FLUME-249 ]
          Jonathan Hsieh made changes -
          Fix Version/s v0.9.1u1 [ 10030 ]
          Hide
          Jonathan Hsieh added a comment -

          Closing released issues.

          Show
          Jonathan Hsieh added a comment - Closing released issues.
          Jonathan Hsieh made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mark Thomas made changes -
          Project Import Tue Aug 02 16:57:12 UTC 2011 [ 1312304232406 ]

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Jonathan Hsieh
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development