Flume
  1. Flume
  2. FLUME-1685

ExecSource shouldn't die if the channel is full

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: v1.2.0, v1.3.0, v1.4.0
    • Fix Version/s: None
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      Imagine this scenario. You are using the ExecSource to tail a file and send to a file channel. When the file channel fills due to a temporary issue downstream, the source gets a ChannelException which kills the source.

      2012-10-31 20:45:57,872 ERROR source.ExecSource: Failed while running command: tail -F /tmp/test.log
      org.apache.flume.ChannelException: Unable to put batch on required channel: FileChannel test { dataDirs: [/tmp/test/data] }
              at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:195)
              at org.apache.flume.source.ExecSource$ExecRunnable.run(ExecSource.java:275)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
              at java.util.concurrent.FutureTask.run(FutureTask.java:138)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      Caused by: org.apache.flume.ChannelException: Cannot acquire capacity. [channel=hbasejson]
              at org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doPut(FileChannel.java:346)
              at org.apache.flume.channel.BasicTransactionSemantics.put(BasicTransactionSemantics.java:93)
              at org.apache.flume.channel.BasicChannelSemantics.put(BasicChannelSemantics.java:76)
              at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:184)
              ... 7 more
      

      The situation where the command being 'exec'ed fails/exits is already handled with the existing retry logic.

      I suggest that when the source gets a ChannelException it throw the event away (since there is nowhere to put it) and instead sleep for second and loop again for another event. If the channel is still throwing an exception (still full), the event dropped and the sleep time doubled and we repeat again. There should be an upper bound on the retry time (say 128 seconds – about 2 minutes) for the next attempt. When the putEvent no longer throws a ChannelException, the "fallback" mode is reset and we read records at full speed again.

      Clearly in a situation where the channel is full, data loss will happen. But in this case, we wouldn't have to restart the agent. At scale this is an administrative pain. Even detecting this is difficult as the flume agent itself is still running. In this case (running a 'tail'), the tail will eventually result in data loss should the file being tailed rotate. Something has to give somewhere.

      I've got a patch I'm working on for this, but wanted to get the JIRA rolling first.

        Activity

        Hide
        Steve Hoffman added a comment - - edited

        Attached patch for trunk as well as 1.2.0 (which I use – cdh3u5)

        And how about using 'git apply' this time around so I can get some street cred for the patch

        Show
        Steve Hoffman added a comment - - edited Attached patch for trunk as well as 1.2.0 (which I use – cdh3u5) And how about using 'git apply' this time around so I can get some street cred for the patch
        Hide
        Steve Hoffman added a comment -

        Any word on the state of inclusion of this patch?

        Show
        Steve Hoffman added a comment - Any word on the state of inclusion of this patch?
        Hide
        Brock Noland added a comment -

        Hi Steve,

        I think we'd have to have a flag to turn this on/off because it would change behavior which is disallowed in a minor release. Other than that a few comments:

        1) Create a RB item fo this (reviews.apache.org) and link it here
        2) Limit lines to approximately 80 chars

        Thanks for your patch!!
        Brock

        Show
        Brock Noland added a comment - Hi Steve, I think we'd have to have a flag to turn this on/off because it would change behavior which is disallowed in a minor release. Other than that a few comments: 1) Create a RB item fo this (reviews.apache.org) and link it here 2) Limit lines to approximately 80 chars Thanks for your patch!! Brock
        Hide
        Steve Hoffman added a comment -

        I would argue that the behavior is broken. I'm assuming that the rule doesn't apply if the functionality is broken. Should this situation occur, the source stops processing data even if the channel empties and never recovers. I can't see this as desired behavior under any situation. You'd have to call the flag "stay.broken" with a default of "true"

        I can't create a review in RB. It seems to want to know which repo the patch is committed in - which it isn't (since I'm not an official committer). Thoughts?

        Show
        Steve Hoffman added a comment - I would argue that the behavior is broken. I'm assuming that the rule doesn't apply if the functionality is broken. Should this situation occur, the source stops processing data even if the channel empties and never recovers. I can't see this as desired behavior under any situation. You'd have to call the flag "stay.broken" with a default of "true" I can't create a review in RB. It seems to want to know which repo the patch is committed in - which it isn't (since I'm not an official committer). Thoughts?
        Hide
        Juhani Connolly added a comment -

        Select flume-git. The review board doesn't commit the patch, it just needs the reference point so it can give a useful view of the code for reviewers.

        The flag is for consistency of functioning. Some people may be tracking the state of the source with ganglia. Due to the fact that full channel => failed puts => rollback but execsource can't rollback, it presents the problem that if exec source keeps going there will be some events missing from the chain. While I doubt anyone relies on this functionality, it doesn't hurt to add a flag(dieOnChannelException?)

        Show
        Juhani Connolly added a comment - Select flume-git. The review board doesn't commit the patch, it just needs the reference point so it can give a useful view of the code for reviewers. The flag is for consistency of functioning. Some people may be tracking the state of the source with ganglia. Due to the fact that full channel => failed puts => rollback but execsource can't rollback, it presents the problem that if exec source keeps going there will be some events missing from the chain. While I doubt anyone relies on this functionality, it doesn't hurt to add a flag(dieOnChannelException?)
        Hide
        Steve Hoffman added a comment -

        New review created. Old one discarded.

        WRT the configurable flag, I also came across this issue: FLUME-979
        I believe this is the more common case of a user deciding if they want the restart under "failure" conditions. The difference being 979 deals with the exec'ed process dying and if flume should re-exec the command. FLUME-1685 is about flume killing the exec'ed process because of a channel append error (full, broken, etc) leaving a restart of flume (and other channels it may be dealing with as well) the only recovery option for the channel. As you said ExecSource can't roll back - but you have the data and nowhere to put it. Killing the source results in identical behavior - throwing the event away - except the channel will never recover w/o a restart (again effecting other channels). I maintain this is undesirable behavior under any situation (hence my opposition to a configuration flag on this matter).

        Here we plugged into metrics collection to alarm on when a channel fills (or better before it is full so we can maybe do something about it) rather than trying to watch the forked process. We've had many occasion where the forked process appears to be running but not actually doing anything (aka generating events). We use graphite not ganglia but the idea is the same. Like us, they are probably watching the channel depth, rate data is being read at the source end and rate at which it comes out the sink end for an indicator of health. Beyond that, it'll be specific to the use case.

        Yes, it doesn't hurt to add the flag, but the default you are suggesting seems non-obvious.

        Show
        Steve Hoffman added a comment - New review created. Old one discarded. WRT the configurable flag, I also came across this issue: FLUME-979 I believe this is the more common case of a user deciding if they want the restart under "failure" conditions. The difference being 979 deals with the exec'ed process dying and if flume should re-exec the command. FLUME-1685 is about flume killing the exec'ed process because of a channel append error (full, broken, etc) leaving a restart of flume (and other channels it may be dealing with as well) the only recovery option for the channel. As you said ExecSource can't roll back - but you have the data and nowhere to put it. Killing the source results in identical behavior - throwing the event away - except the channel will never recover w/o a restart (again effecting other channels). I maintain this is undesirable behavior under any situation (hence my opposition to a configuration flag on this matter). Here we plugged into metrics collection to alarm on when a channel fills (or better before it is full so we can maybe do something about it) rather than trying to watch the forked process. We've had many occasion where the forked process appears to be running but not actually doing anything (aka generating events). We use graphite not ganglia but the idea is the same. Like us, they are probably watching the channel depth, rate data is being read at the source end and rate at which it comes out the sink end for an indicator of health. Beyond that, it'll be specific to the use case. Yes, it doesn't hurt to add the flag, but the default you are suggesting seems non-obvious.

          People

          • Assignee:
            Unassigned
            Reporter:
            Steve Hoffman
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development