Flume
  1. Flume
  2. FLUME-543

TailThread exits on repeated WAL failures

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: v0.9.3
    • Fix Version/s: v0.9.5
    • Component/s: Sinks+Sources
    • Labels:
      None
    • Environment:

      Amazon EC2 Ubuntu 10.04 LTS

      Description

      On repeated failures:

      2011-02-28 23:37:26,908 ERROR com.cloudera.flume.core.connector.DirectDriver: Driving src/sink failed! LazyOpenSource | LazyOpenDecorator because java.lang.IllegalStateException: Expected state to be LOGGED but was SENT
      java.io.IOException: java.lang.IllegalStateException: Expected state to be LOGGED but was SENT
      	at com.cloudera.flume.agent.durability.NaiveFileWALDeco$1.fireError(NaiveFileWALDeco.java:254)
      	at com.cloudera.flume.core.connector.DirectDriver.fireError(DirectDriver.java:216)
      	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:102)
      Caused by: java.lang.IllegalStateException: Expected state to be LOGGED but was SENT
      	at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
      	at com.cloudera.flume.agent.durability.NaiveFileWALManager.changeState(NaiveFileWALManager.java:410)
      	at com.cloudera.flume.agent.durability.NaiveFileWALManager.getUnackedSource(NaiveFileWALManager.java:557)
      	at com.cloudera.flume.agent.durability.WALSource.next(WALSource.java:104)
      	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:89)
      

      The tailer exits and the master reports all nodes in ERROR state.

      2011-02-28 23:37:27,917 INFO com.cloudera.flume.handlers.text.TailSource: TailThread has exited
      

      Full log file attached.

      Rob

        Issue Links

          Activity

          Hide
          Disabled imported user added a comment -

          Configuration follows:

          ec2-50-16-242-85.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink;
          ec2-50-16-242-86.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink;
          ec2-50-16-242-91.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink;
          ec2-50-16-242-99.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink;
          ec2-50-16-242-126.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink;
          ec2-50-16-242-100.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink;
          ec2-50-17-224-11.compute-1.amazonaws.com : collectorSource | collectorSink("s3n://KEY:SECRET@str-flume-ec2production/%Y-%m-%d/%H00/", "nginx-");
          
          Show
          Disabled imported user added a comment - Configuration follows: ec2-50-16-242-85.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink; ec2-50-16-242-86.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink; ec2-50-16-242-91.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink; ec2-50-16-242-99.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink; ec2-50-16-242-126.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink; ec2-50-16-242-100.compute-1.amazonaws.com : tail("/opt/nginx/logs/access.log") | agentE2ESink; ec2-50-17-224-11.compute-1.amazonaws.com : collectorSource | collectorSink("s3n://KEY:SECRET@str-flume-ec2production/%Y-%m-%d/%H00/", "nginx-");
          Hide
          Disabled imported user added a comment -

          Affected agent log file attached.

          Show
          Disabled imported user added a comment - Affected agent log file attached.
          Hide
          Disabled imported user added a comment -

          I'd like to up the severity from MAJOR to CRITICAL if possible; can't figure out how to edit the ticket =/

          Show
          Disabled imported user added a comment - I'd like to up the severity from MAJOR to CRITICAL if possible; can't figure out how to edit the ticket =/
          Hide
          Jonathan Hsieh added a comment -

          I am assuming you set the collector host in the flume-site.xml file right?

          Show
          Jonathan Hsieh added a comment - I am assuming you set the collector host in the flume-site.xml file right?
          Hide
          Disabled imported user added a comment -

          Yep, here are our configs:

          Agents:

          <?xml version="1.0"?>
          <?xml-stylesheet type="text/xsl"  href="configuration.xsl"?>
          
          <configuration>
          
            <property>
              <name>flume.master.servers</name>
              <value>ec2-XXX.compute-1.amazonaws.com</value>
            </property>
          
            <property>
              <name>flume.collector.event.host</name>
              <value>ec2-XXX.compute-1.amazonaws.com</value>
            </property>
          
            <property>
              <name>flume.agent.logdir</name>
              <value>/mnt/logs/wal</value>
            </property>
          
          </configuration>
          

          Collector:

          <?xml version="1.0"?>
          <?xml-stylesheet type="text/xsl"  href="configuration.xsl"?>
          
          <configuration>
          
            <property>
              <name>flume.master.servers</name>
              <value>ec2-XXX.compute-1.amazonaws.com</value>
            </property>
          
          </configuration>
          
          Show
          Disabled imported user added a comment - Yep, here are our configs: Agents: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>flume.master.servers</name> <value>ec2-XXX.compute-1.amazonaws.com</value> </property> <property> <name>flume.collector.event.host</name> <value>ec2-XXX.compute-1.amazonaws.com</value> </property> <property> <name>flume.agent.logdir</name> <value>/mnt/logs/wal</value> </property> </configuration> Collector: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>flume.master.servers</name> <value>ec2-XXX.compute-1.amazonaws.com</value> </property> </configuration>
          Hide
          Jonathan Hsieh added a comment -

          Does everyone who voted for and are watching this issue using amazon ec2/s3n?

          Show
          Jonathan Hsieh added a comment - Does everyone who voted for and are watching this issue using amazon ec2/s3n?
          Hide
          Disabled imported user added a comment -

          Yes, we're on ec2.

          Show
          Disabled imported user added a comment - Yes, we're on ec2.
          Hide
          Jonathan Hsieh added a comment -

          First error in Robert's logs are in StubbornAppendSink. This code has been changed in FLUME-589.

          Show
          Jonathan Hsieh added a comment - First error in Robert's logs are in StubbornAppendSink. This code has been changed in FLUME-589 .
          Hide
          Jonathan Hsieh added a comment -

          I'm trying to figure out how to reliably duplicate this problem. It is not clear from log if nodes are being reconfigured on purpose or if it is spontaneously happening.

          Tail exiting is likely due to an exception on append that gets propagated to the driver which triggers a shutdown of the logical node's source and sink.

          There seem to be several problems and here are current theories for root cause:

          1) Improperly handled exception (this could be a runtime exception or possibly an incorrectly handled interrupted exception). It looks like in the first set of problems an IllegalStateException eventually gets thrown.
          2) Stubborn append sink may have a race condition on error recovery path.
          3) There may be a race/error in the retransmit logic path.
          4) It is possible that multiple instances of the NaiveWALManager are active simultaneously causing the state transition problems.

          Also note, that several of the mechanisms from this trace have been significantly modified by FLUME-569, FLUME-589, FLUME-595, FLUME-597. These changes will likely make the problems manifest differently.

          Show
          Jonathan Hsieh added a comment - I'm trying to figure out how to reliably duplicate this problem. It is not clear from log if nodes are being reconfigured on purpose or if it is spontaneously happening. Tail exiting is likely due to an exception on append that gets propagated to the driver which triggers a shutdown of the logical node's source and sink. There seem to be several problems and here are current theories for root cause: 1) Improperly handled exception (this could be a runtime exception or possibly an incorrectly handled interrupted exception). It looks like in the first set of problems an IllegalStateException eventually gets thrown. 2) Stubborn append sink may have a race condition on error recovery path. 3) There may be a race/error in the retransmit logic path. 4) It is possible that multiple instances of the NaiveWALManager are active simultaneously causing the state transition problems. Also note, that several of the mechanisms from this trace have been significantly modified by FLUME-569 , FLUME-589 , FLUME-595 , FLUME-597 . These changes will likely make the problems manifest differently.
          Hide
          Jonathan Hsieh added a comment -

          There are several life-cycle fixes that have gone in which I think this maybe related to. We won't be able tell until that an other patches soak for a little.

          Show
          Jonathan Hsieh added a comment - There are several life-cycle fixes that have gone in which I think this maybe related to. We won't be able tell until that an other patches soak for a little.
          Hide
          Jonathan Hsieh added a comment -

          I believe the root cause of this has something to do with resend/retry after an agent times out and improperly/racily changing the state of a log entry.

          Show
          Jonathan Hsieh added a comment - I believe the root cause of this has something to do with resend/retry after an agent times out and improperly/racily changing the state of a log entry.
          Hide
          Jonathan Hsieh added a comment -

          I believe root cause of this problem has been found and solved by FLUME-745. Essentially there was a race condition with the wal statemachine that can occur when a retry is triggered. This is also the likely the cause of much of the duplicates

          Show
          Jonathan Hsieh added a comment - I believe root cause of this problem has been found and solved by FLUME-745 . Essentially there was a race condition with the wal statemachine that can occur when a retry is triggered. This is also the likely the cause of much of the duplicates

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Disabled imported user
            • Votes:
              9 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development