Flume
  1. Flume
  2. FLUME-746

Correct the behavior and logging messages about states transition of wal chunks on retry

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: v0.9.4
    • Fix Version/s: v0.9.5
    • Component/s: Node
    • Labels:

      Description

      Flume logs often have scary looking log messages that look like this:

      2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms
      2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
      2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms
      2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
      2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms
      2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

      This because previously we only expected deal with three states:

      LOGGED, SENDING, SENT.

      We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

      Current state, state to transition to.
      IMPORT -> IMPORT // new warn that this is an odd case.
      WRITING -> WRITING // new warn that this is an odd case.
      LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.
      SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry
      SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.
      E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.
      others -> others // other states are unexpected and remain in their state.

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          19m 28s 1 Jonathan Hsieh 18/Aug/11 16:49
          Patch Available Patch Available Resolved Resolved
          1d 9h 6m 1 Jonathan Hsieh 20/Aug/11 01:56
          Gavin made changes -
          Link This issue is depended upon by FLUME-745 [ FLUME-745 ]
          Gavin made changes -
          Link This issue blocks FLUME-745 [ FLUME-745 ]
          Jonathan Hsieh made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Jonathan Hsieh made changes -
          Link This issue blocks FLUME-745 [ FLUME-745 ]
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/#review1574
          -----------------------------------------------------------

          Ship it!

          lgtm.

          • Eric

          On 2011-08-19 18:43:32, jmhsieh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1584/

          -----------------------------------------------------------

          (Updated 2011-08-19 18:43:32)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Summary

          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.

          IMPORT -> IMPORT // new warn that this is an odd case.

          WRITING -> WRITING // new warn that this is an odd case.

          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.

          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry

          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.

          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.

          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.

          https://issues.apache.org/jira/browse/flume-746

          Diffs

          -----

          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b

          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing

          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/#review1574 ----------------------------------------------------------- Ship it! lgtm. Eric On 2011-08-19 18:43:32, jmhsieh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- (Updated 2011-08-19 18:43:32) Review request for Flume, Arvind Prabhakar and Eric Sammer. Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs ----- flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/
          -----------------------------------------------------------

          (Updated 2011-08-19 18:43:32.589078)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Changes
          -------

          This patch changes behavior so that it throws IllegalStateException when it makes an impossible state transition. I've also updated a test to reflect this. Finally, if something gets into ERROR state, I change the behavior so that it stays in error state. (we've already caught this error, we should not throw a new error again!)

          Summary
          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.
          IMPORT -> IMPORT // new warn that this is an odd case.
          WRITING -> WRITING // new warn that this is an odd case.
          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.
          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry
          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.
          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.
          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.
          https://issues.apache.org/jira/browse/flume-746

          Diffs (updated)


          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b
          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing
          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- (Updated 2011-08-19 18:43:32.589078) Review request for Flume, Arvind Prabhakar and Eric Sammer. Changes ------- This patch changes behavior so that it throws IllegalStateException when it makes an impossible state transition. I've also updated a test to reflect this. Finally, if something gets into ERROR state, I change the behavior so that it stays in error state. (we've already caught this error, we should not throw a new error again!) Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs (updated) flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh
          Jonathan Hsieh made changes -
          Hide
          Jonathan Hsieh added a comment -

          Updated code/test to make bad state throw IllegalStateException.

          Show
          Jonathan Hsieh added a comment - Updated code/test to make bad state throw IllegalStateException.
          Hide
          Jonathan Hsieh added a comment -

          The error messages show in this FLUME-543 is likely one of the cases that after this analysis is actually legal and correct.

          Show
          Jonathan Hsieh added a comment - The error messages show in this FLUME-543 is likely one of the cases that after this analysis is actually legal and correct.
          Jonathan Hsieh made changes -
          Link This issue blocks FLUME-543 [ FLUME-543 ]
          Jonathan Hsieh made changes -
          Link This issue blocks FLUME-745 [ FLUME-745 ]
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-08-18 21:05:10, Eric Sammer wrote:

          > flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java, lines 818-821

          > <https://reviews.apache.org/r/1584/diff/1/?file=33425#file33425line818>

          >

          > If we're doing odd state transitions, it seems like a bug and we should complain loudly (i.e. fast fast, fail big). What do you think about making these IllegalStateExceptions?

          I think that is a great idea. I'll change that and update tests if necessary.

          • jmhsieh

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/#review1530
          -----------------------------------------------------------

          On 2011-08-18 15:48:33, jmhsieh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1584/

          -----------------------------------------------------------

          (Updated 2011-08-18 15:48:33)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Summary

          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.

          IMPORT -> IMPORT // new warn that this is an odd case.

          WRITING -> WRITING // new warn that this is an odd case.

          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.

          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry

          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.

          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.

          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.

          https://issues.apache.org/jira/browse/flume-746

          Diffs

          -----

          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b

          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing

          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-08-18 21:05:10, Eric Sammer wrote: > flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java, lines 818-821 > < https://reviews.apache.org/r/1584/diff/1/?file=33425#file33425line818 > > > If we're doing odd state transitions, it seems like a bug and we should complain loudly (i.e. fast fast, fail big). What do you think about making these IllegalStateExceptions? I think that is a great idea. I'll change that and update tests if necessary. jmhsieh ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/#review1530 ----------------------------------------------------------- On 2011-08-18 15:48:33, jmhsieh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- (Updated 2011-08-18 15:48:33) Review request for Flume, Arvind Prabhakar and Eric Sammer. Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs ----- flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/#review1530
          -----------------------------------------------------------

          I have a question about how tolerant of illegal state transitions we should be. I'm in favor of failing here. Thoughts?

          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java
          <https://reviews.apache.org/r/1584/#comment3482>

          If we're doing odd state transitions, it seems like a bug and we should complain loudly (i.e. fast fast, fail big). What do you think about making these IllegalStateExceptions?

          • Eric

          On 2011-08-18 15:48:33, jmhsieh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1584/

          -----------------------------------------------------------

          (Updated 2011-08-18 15:48:33)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Summary

          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.

          IMPORT -> IMPORT // new warn that this is an odd case.

          WRITING -> WRITING // new warn that this is an odd case.

          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.

          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry

          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.

          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.

          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.

          https://issues.apache.org/jira/browse/flume-746

          Diffs

          -----

          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b

          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing

          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/#review1530 ----------------------------------------------------------- I have a question about how tolerant of illegal state transitions we should be. I'm in favor of failing here. Thoughts? flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java < https://reviews.apache.org/r/1584/#comment3482 > If we're doing odd state transitions, it seems like a bug and we should complain loudly (i.e. fast fast, fail big). What do you think about making these IllegalStateExceptions? Eric On 2011-08-18 15:48:33, jmhsieh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- (Updated 2011-08-18 15:48:33) Review request for Flume, Arvind Prabhakar and Eric Sammer. Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs ----- flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/
          -----------------------------------------------------------

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Summary
          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.
          IMPORT -> IMPORT // new warn that this is an odd case.
          WRITING -> WRITING // new warn that this is an odd case.
          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.
          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry
          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.
          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.
          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.
          https://issues.apache.org/jira/browse/flume-746

          Diffs


          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b
          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing
          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- Review request for Flume, Arvind Prabhakar and Eric Sammer. Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh
          Jonathan Hsieh made changes -
          Jonathan Hsieh made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Assignee Jonathan Hsieh [ jmhsieh ]
          Hide
          Jonathan Hsieh added a comment -
          Show
          Jonathan Hsieh added a comment - review here https://reviews.apache.org/r/1584/
          Jonathan Hsieh made changes -
          Field Original Value New Value
          Description Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.
          IMPORT -> IMPORT // *new* warn that this is an odd case.
          WRITING -> WRITING // *new* warn that this is an odd case
          LOGGED -> LOGGED // if it is log, it is slated for retry so stay put
          SENDING -> SENDING // *This is the change* -- This is legal -- if we are sending the chunk already, keep sending it, no need to retry
          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.
          E2EACKED -> E2EACKED // *new* acked already means it is good. No need to retry.
          others -> others // other states are unexpected and remain in their state.




          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.
          IMPORT -> IMPORT // *new* warn that this is an odd case.
          WRITING -> WRITING // *new* warn that this is an odd case.
          LOGGED -> LOGGED // *This is a change, used to be considered race* -- This is legal -- f it is log, it is slated for retry so stay put.
          SENDING -> SENDING // *This is the change, used to be considered race* -- This is legal -- if we are sending the chunk already, keep sending it, no need to retry
          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.
          E2EACKED -> E2EACKED // *new* acked already means it is good. No need to retry.
          others -> others // other states are unexpected and remain in their state.




          Jonathan Hsieh created issue -

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Jonathan Hsieh
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development