Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-553

collectorSink becomes unstable when hdfs goes down using e2e (but it "gets fixed" when restarting the collector)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Won't Fix
    • Affects Version/s: 0.9.3
    • Fix Version/s: 0.9.5
    • Component/s: Sinks+Sources
    • Labels:

      Description

      I've found a case when a collector using collectorSink (in e2e mode)
      becomes unstable if hdfs enters safeMode (and leaves it after a
      while).
      My collectorSink looks like:

      exec config collector02.data data-flow autoCollectorSource
      'collectorSink("hdfs://host.namel/user/flume/input", "data-",
      30000)'

      The situation would be:
      1- The collector is saving data to hdfs properly .
      2- I turn on safe mode on hdfs.
      3- Collector starts failing, this is normal as can't write to hdfs.
      4- After the collector has been failing trying to create 2 files (more
      than 30000 mili * 2) I make hdfs leave the safe mode.
      5- Since now, even hdfs is not in safemode anymore, I will sometimes
      be getting errors like:

      NFO endtoend.AckChecksumChecker: Starting checksum group called log.
      00000033.20110309-172500877+0100.1299687900877700000.seq
      2011-03-09 17:26:12,415 [logicalNode collector02.data-25] INFO
      endtoend.AckChecksumChecker: initial checksum is 12e9b6f86ce
      2011-03-09 17:26:12,415 [logicalNode collector02.data-25] INFO
      hdfs.EscapedCustomDfsSink: Closing
      hdfs://sturlese.local/user/flume/input/data-log.00000025.20110309-172540945 +0100.1299687940945069000.seq
      2011-03-09 17:26:12,415 [logicalNode collector02.data-25] INFO
      hdfs.CustomDfsSink: Closing HDFS file:
      hdfs://sturlese.local/user/flume/input/data-log.00000025.20110309-172540945 +0100.1299687940945069000.seq
      2011-03-09 17:26:12,415 [logicalNode collector02.data-25] INFO
      hdfs.CustomDfsSink: done writing raw file to hdfs
      2011-03-09 17:26:12,435 [logicalNode collector02.data-25] INFO
      hdfs.EscapedCustomDfsSink: Closing
      hdfs://sturlese.local/user/flume/input/data-log.00000044.20110309-172430963 +0100.1299687870963688000.seq
      2011-03-09 17:26:12,435 [logicalNode collector02.data-25] INFO
      hdfs.CustomDfsSink: Closing HDFS file:
      hdfs://sturlese.local/user/flume/input/data-log.00000044.20110309-172430963 +0100.1299687870963688000.seq
      2011-03-09 17:26:12,435 [logicalNode collector02.data-25] INFO
      hdfs.CustomDfsSink: done writing raw file to hdfs
      2011-03-09 17:26:12,435 [logicalNode collector02.data-25] ERROR
      rolling.RollSink: Failure when attempting to rotate and open new sink:
      org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot add
      block to /user/flume/input/data-log.
      00000044.20110309-172430963+0100.1299687870963688000.seq. Name node is
      in safe mode.
      Use "hadoop dfsadmin -safemode leave" to turn safe mode off.
      at
      org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNa mesystem.java:
      1318)
      at
      org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:
      469)
      at sun.reflect.Generated
      MethodAccessor11.invoke(Unknown Source)
      at
      sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp l.java:
      25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
      2011-03-09 17:26:12,435 [logicalNode collector02.data-25] INFO
      hdfs.EscapedCustomDfsSink: Opening
      hdfs://host.name/user/flume/input/data-log.00000025.20110309-172612415+0100 .1299687972415285000.seq

      If I refresh the collector using the shell (exec refresh
      collector02.data) the error will disappear and all data that couldn't
      be written will suddenly be (because of the e2e).

      Reproducing the situation many times from a
      fresh installation every time, some of them it gets recovered without
      the need of refreshing, but most of them not.

      Jonathan Hsieh tip:
      At the highest level, it seems like there is some bad state sticking
      around that doesn't get cleaned out properly when recovering.
      Internally the escaping code ends up holding many hdfs file handles
      open. My guess is that we are in a situation with multiple handles
      open and the error occurs. We properly handle one error but the other
      broken file handles (based on connections that were initiated in safe
      mode) are not cleaned up. The bad handles eventually get closed when
      a roll tries to flush them, and this is when the actual error message
      appears.
      The refresh closes everything down and the restarts it, so all the new
      connections are all good.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              flume_sturlese Disabled imported user
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: