Flume
  1. Flume
  2. FLUME-927

A Flume agent started before collectors in E2E mode could fail to connect to the collector

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: v0.9.4, v0.9.5
    • Fix Version/s: v0.9.5
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      The write ahead log (WAL) mechanism expects the agent sink to be active in 1 second. After that, it assumes that the agent couldn't connect to collector and shuts it down. The AgentSink has a retry mechanism that handles network problems, unavailable collector etc for a configurable amount of time. The hardcode 1 sec timeout in WAL decorator invalidates this retry mechanism.

      1. Flume-927.patch.1
        6 kB
        Prasad Mujumdar

        Activity

        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3487/
        -----------------------------------------------------------

        Review request for Mingjie Lai and jmhsieh.

        Summary
        -------

        When the WAL decorator starts its subsink, it waits for one second for it to be active. If the subsink doesn't start in that interval then it goes ahead and mark it for stop and hence making the agent idle.
        The agent sinks contains retry sink which will keep trying the open till is succeed. The WAL forcing it to close in one second makes this retry mechanism useless and forces user to restart the agent.
        The patch is to wait for the subsink to be active, only exceptions in the subsink will abort the wait.

        This addresses bug FLUME-927.
        https://issues.apache.org/jira/browse/FLUME-927

        Diffs


        flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALDeco.java 3a88ab8
        flume-core/src/main/java/com/cloudera/flume/handlers/debug/DelayDecorator.java 15a9066
        flume-core/src/test/java/com/cloudera/flume/agent/durability/TestNaiveFileWALDeco.java 8dd45fa

        Diff: https://reviews.apache.org/r/3487/diff

        Testing
        -------

        added new testcase. will run the full regression test suite.

        Thanks,

        Prasad

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3487/ ----------------------------------------------------------- Review request for Mingjie Lai and jmhsieh. Summary ------- When the WAL decorator starts its subsink, it waits for one second for it to be active. If the subsink doesn't start in that interval then it goes ahead and mark it for stop and hence making the agent idle. The agent sinks contains retry sink which will keep trying the open till is succeed. The WAL forcing it to close in one second makes this retry mechanism useless and forces user to restart the agent. The patch is to wait for the subsink to be active, only exceptions in the subsink will abort the wait. This addresses bug FLUME-927 . https://issues.apache.org/jira/browse/FLUME-927 Diffs flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALDeco.java 3a88ab8 flume-core/src/main/java/com/cloudera/flume/handlers/debug/DelayDecorator.java 15a9066 flume-core/src/test/java/com/cloudera/flume/agent/durability/TestNaiveFileWALDeco.java 8dd45fa Diff: https://reviews.apache.org/r/3487/diff Testing ------- added new testcase. will run the full regression test suite. Thanks, Prasad
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3487/#review4385
        -----------------------------------------------------------

        Ship it!

        Nice patch prasad. lgtm.

        • jmhsieh

        On 2012-01-13 18:10:49, Prasad Mujumdar wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3487/

        -----------------------------------------------------------

        (Updated 2012-01-13 18:10:49)

        Review request for Mingjie Lai and jmhsieh.

        Summary

        -------

        When the WAL decorator starts its subsink, it waits for one second for it to be active. If the subsink doesn't start in that interval then it goes ahead and mark it for stop and hence making the agent idle.

        The agent sinks contains retry sink which will keep trying the open till is succeed. The WAL forcing it to close in one second makes this retry mechanism useless and forces user to restart the agent.

        The patch is to wait for the subsink to be active, only exceptions in the subsink will abort the wait.

        This addresses bug FLUME-927.

        https://issues.apache.org/jira/browse/FLUME-927

        Diffs

        -----

        flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALDeco.java 3a88ab8

        flume-core/src/main/java/com/cloudera/flume/handlers/debug/DelayDecorator.java 15a9066

        flume-core/src/test/java/com/cloudera/flume/agent/durability/TestNaiveFileWALDeco.java 8dd45fa

        Diff: https://reviews.apache.org/r/3487/diff

        Testing

        -------

        added new testcase. will run the full regression test suite.

        Thanks,

        Prasad

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3487/#review4385 ----------------------------------------------------------- Ship it! Nice patch prasad. lgtm. jmhsieh On 2012-01-13 18:10:49, Prasad Mujumdar wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3487/ ----------------------------------------------------------- (Updated 2012-01-13 18:10:49) Review request for Mingjie Lai and jmhsieh. Summary ------- When the WAL decorator starts its subsink, it waits for one second for it to be active. If the subsink doesn't start in that interval then it goes ahead and mark it for stop and hence making the agent idle. The agent sinks contains retry sink which will keep trying the open till is succeed. The WAL forcing it to close in one second makes this retry mechanism useless and forces user to restart the agent. The patch is to wait for the subsink to be active, only exceptions in the subsink will abort the wait. This addresses bug FLUME-927 . https://issues.apache.org/jira/browse/FLUME-927 Diffs ----- flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALDeco.java 3a88ab8 flume-core/src/main/java/com/cloudera/flume/handlers/debug/DelayDecorator.java 15a9066 flume-core/src/test/java/com/cloudera/flume/agent/durability/TestNaiveFileWALDeco.java 8dd45fa Diff: https://reviews.apache.org/r/3487/diff Testing ------- added new testcase. will run the full regression test suite. Thanks, Prasad

          People

          • Assignee:
            Prasad Mujumdar
            Reporter:
            Prasad Mujumdar
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development