Flume
  1. Flume
  2. FLUME-746

Correct the behavior and logging messages about states transition of wal chunks on retry

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: v0.9.4
    • Fix Version/s: v0.9.5
    • Component/s: Node
    • Labels:

      Description

      Flume logs often have scary looking log messages that look like this:

      2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms
      2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
      2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms
      2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
      2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms
      2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

      This because previously we only expected deal with three states:

      LOGGED, SENDING, SENT.

      We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

      Current state, state to transition to.
      IMPORT -> IMPORT // new warn that this is an odd case.
      WRITING -> WRITING // new warn that this is an odd case.
      LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.
      SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry
      SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.
      E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.
      others -> others // other states are unexpected and remain in their state.

        Issue Links

          Activity

          Hide
          Jonathan Hsieh added a comment -
          Show
          Jonathan Hsieh added a comment - review here https://reviews.apache.org/r/1584/
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/
          -----------------------------------------------------------

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Summary
          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.
          IMPORT -> IMPORT // new warn that this is an odd case.
          WRITING -> WRITING // new warn that this is an odd case.
          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.
          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry
          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.
          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.
          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.
          https://issues.apache.org/jira/browse/flume-746

          Diffs


          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b
          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing
          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- Review request for Flume, Arvind Prabhakar and Eric Sammer. Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/#review1530
          -----------------------------------------------------------

          I have a question about how tolerant of illegal state transitions we should be. I'm in favor of failing here. Thoughts?

          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java
          <https://reviews.apache.org/r/1584/#comment3482>

          If we're doing odd state transitions, it seems like a bug and we should complain loudly (i.e. fast fast, fail big). What do you think about making these IllegalStateExceptions?

          • Eric

          On 2011-08-18 15:48:33, jmhsieh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1584/

          -----------------------------------------------------------

          (Updated 2011-08-18 15:48:33)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Summary

          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.

          IMPORT -> IMPORT // new warn that this is an odd case.

          WRITING -> WRITING // new warn that this is an odd case.

          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.

          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry

          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.

          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.

          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.

          https://issues.apache.org/jira/browse/flume-746

          Diffs

          -----

          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b

          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing

          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/#review1530 ----------------------------------------------------------- I have a question about how tolerant of illegal state transitions we should be. I'm in favor of failing here. Thoughts? flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java < https://reviews.apache.org/r/1584/#comment3482 > If we're doing odd state transitions, it seems like a bug and we should complain loudly (i.e. fast fast, fail big). What do you think about making these IllegalStateExceptions? Eric On 2011-08-18 15:48:33, jmhsieh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- (Updated 2011-08-18 15:48:33) Review request for Flume, Arvind Prabhakar and Eric Sammer. Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs ----- flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-08-18 21:05:10, Eric Sammer wrote:

          > flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java, lines 818-821

          > <https://reviews.apache.org/r/1584/diff/1/?file=33425#file33425line818>

          >

          > If we're doing odd state transitions, it seems like a bug and we should complain loudly (i.e. fast fast, fail big). What do you think about making these IllegalStateExceptions?

          I think that is a great idea. I'll change that and update tests if necessary.

          • jmhsieh

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/#review1530
          -----------------------------------------------------------

          On 2011-08-18 15:48:33, jmhsieh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1584/

          -----------------------------------------------------------

          (Updated 2011-08-18 15:48:33)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Summary

          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.

          IMPORT -> IMPORT // new warn that this is an odd case.

          WRITING -> WRITING // new warn that this is an odd case.

          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.

          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry

          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.

          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.

          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.

          https://issues.apache.org/jira/browse/flume-746

          Diffs

          -----

          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b

          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing

          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-08-18 21:05:10, Eric Sammer wrote: > flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java, lines 818-821 > < https://reviews.apache.org/r/1584/diff/1/?file=33425#file33425line818 > > > If we're doing odd state transitions, it seems like a bug and we should complain loudly (i.e. fast fast, fail big). What do you think about making these IllegalStateExceptions? I think that is a great idea. I'll change that and update tests if necessary. jmhsieh ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/#review1530 ----------------------------------------------------------- On 2011-08-18 15:48:33, jmhsieh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- (Updated 2011-08-18 15:48:33) Review request for Flume, Arvind Prabhakar and Eric Sammer. Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs ----- flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh
          Hide
          Jonathan Hsieh added a comment -

          The error messages show in this FLUME-543 is likely one of the cases that after this analysis is actually legal and correct.

          Show
          Jonathan Hsieh added a comment - The error messages show in this FLUME-543 is likely one of the cases that after this analysis is actually legal and correct.
          Hide
          Jonathan Hsieh added a comment -

          Updated code/test to make bad state throw IllegalStateException.

          Show
          Jonathan Hsieh added a comment - Updated code/test to make bad state throw IllegalStateException.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/
          -----------------------------------------------------------

          (Updated 2011-08-19 18:43:32.589078)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Changes
          -------

          This patch changes behavior so that it throws IllegalStateException when it makes an impossible state transition. I've also updated a test to reflect this. Finally, if something gets into ERROR state, I change the behavior so that it stays in error state. (we've already caught this error, we should not throw a new error again!)

          Summary
          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states
          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms
          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.
          IMPORT -> IMPORT // new warn that this is an odd case.
          WRITING -> WRITING // new warn that this is an odd case.
          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.
          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry
          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.
          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.
          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.
          https://issues.apache.org/jira/browse/flume-746

          Diffs (updated)


          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b
          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing
          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- (Updated 2011-08-19 18:43:32.589078) Review request for Flume, Arvind Prabhakar and Eric Sammer. Changes ------- This patch changes behavior so that it throws IllegalStateException when it makes an impossible state transition. I've also updated a test to reflect this. Finally, if something gets into ERROR state, I change the behavior so that it stays in error state. (we've already caught this error, we should not throw a new error again!) Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs (updated) flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1584/#review1574
          -----------------------------------------------------------

          Ship it!

          lgtm.

          • Eric

          On 2011-08-19 18:43:32, jmhsieh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1584/

          -----------------------------------------------------------

          (Updated 2011-08-19 18:43:32)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Summary

          -------

          Flume logs often have scary looking log messages that look like this:

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms

          2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states

          This because previously we only expected deal with three states:

          LOGGED, SENDING, SENT.

          We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea:

          Current state, state to transition to.

          IMPORT -> IMPORT // new warn that this is an odd case.

          WRITING -> WRITING // new warn that this is an odd case.

          LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put.

          SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry

          SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry.

          E2EACKED -> E2EACKED // new acked already means it is good. No need to retry.

          others -> others // other states are unexpected and remain in their state.

          This addresses bug flume-746.

          https://issues.apache.org/jira/browse/flume-746

          Diffs

          -----

          flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b

          flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1584/diff

          Testing

          -------

          Tests pass but was run on top of other patches (FLUME-706, revert FLUME-656). This should be orthogonal to those changes. Full suite currently running with just this patch.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/#review1574 ----------------------------------------------------------- Ship it! lgtm. Eric On 2011-08-19 18:43:32, jmhsieh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1584/ ----------------------------------------------------------- (Updated 2011-08-19 18:43:32) Review request for Flume, Arvind Prabhakar and Eric Sammer. Summary ------- Flume logs often have scary looking log messages that look like this: 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110801-235819417-0400.6477874242784836.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-005128611-0400.6740261462532911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states 2011-08-14 00:00:56,177 INFO com.cloudera.flume.agent.WALAckManager: Retransmitting log.00000038.20110805-031622274-0400.6748955125414911.seq after being stale for 60802ms 2011-08-14 00:00:56,177 WARN com.cloudera.flume.agent.durability.NaiveFileWALManager: There was a race that happend with SENT vs SENDING states This because previously we only expected deal with three states: LOGGED, SENDING, SENT. We actually need to deal with all possible states, and importantly, the SENDING state is a valid state to transition from.(not a race as reported). Here's the high-level idea: Current state, state to transition to. IMPORT -> IMPORT // new warn that this is an odd case. WRITING -> WRITING // new warn that this is an odd case. LOGGED -> LOGGED // This is a change, used to be considered race – This is legal – f it is log, it is slated for retry so stay put. SENDING -> SENDING // This is the change, used to be considered race – This is legal – if we are sending the chunk already, keep sending it, no need to retry SENT -> LOGGED // this was sent already but acks didn't work out. move to LOGGED state to retry. E2EACKED -> E2EACKED // new acked already means it is good. No need to retry. others -> others // other states are unexpected and remain in their state. This addresses bug flume-746. https://issues.apache.org/jira/browse/flume-746 Diffs ----- flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b flume-core/src/test/java/com/cloudera/flume/agent/TestFlumeNodeWALNotifier.java PRE-CREATION Diff: https://reviews.apache.org/r/1584/diff Testing ------- Tests pass but was run on top of other patches ( FLUME-706 , revert FLUME-656 ). This should be orthogonal to those changes. Full suite currently running with just this patch. Thanks, jmhsieh

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Jonathan Hsieh
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development