Flume
  1. Flume
  2. FLUME-883

Flume E2E sink could send incorrect ACKs if there are HDFS file close errors

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: v0.9.4
    • Fix Version/s: v0.9.5
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      The E2E collector sink saves the batch tags as the batches are passed to the downstream sinks. The ACKs are flushed when the roller close the file. Currently for the HDFS sink, the close is the only operation that guarantees that data is safely stored. Hence the acks are sent on close. If for some reason, the writes fail then we don't send the acks assuming the data is lost. The E2E mechanism then resends the data.
      The problem is that if the close fails then we don't clear the accumulated acks for that current rolltag. Hence its possible that the next successful roll could send those acks and hence the batch will not be resent.

      1. Flume-883.patch.1
        7 kB
        Prasad Mujumdar

        Activity

        Prasad Mujumdar made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Prasad Mujumdar added a comment -

        Patch committed to trunk

        Show
        Prasad Mujumdar added a comment - Patch committed to trunk
        Prasad Mujumdar made changes -
        Field Original Value New Value
        Attachment Flume-883.patch.1 [ 12507636 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-12-16 00:33:38, Eric Sammer wrote:

        > flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java, line 193

        > <https://reviews.apache.org/r/3214/diff/2/?file=64831#file64831line193>

        >

        > Mark private?

        will do. thanks

        On 2011-12-16 00:33:38, Eric Sammer wrote:

        > flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java, line 196

        > <https://reviews.apache.org/r/3214/diff/2/?file=64831#file64831line196>

        >

        > I've mentally paged out the lock ordering of this code so I can't definitively state there's no deadlock here. I have to defer to you on this one. Just something to double (or triple) check.

        hmm .. I can't think of a deadlock case, anyway the patch has not changed the locking logic. At least I am not introducing a new deadlock

        • Prasad

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3214/#review3935
        -----------------------------------------------------------

        On 2011-12-15 21:24:54, Prasad Mujumdar wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3214/

        -----------------------------------------------------------

        (Updated 2011-12-15 21:24:54)

        Review request for Eric Sammer.

        Summary

        -------

        The E2E collector sink saves the batch tags as the batches are passed to the downstream sinks. The ACKs are flushed when the roller close the file. Currently for the HDFS sink, the close is the only operation that guarantees that data is safely stored. Hence the acks are sent on close. If for some reason, the writes fail then we don't send the acks assuming the data is lost. The E2E mechanism then resends the data.

        The problem is that if the close fails then we don't clear the accumulated acks for that current rolltag. Hence its possible that the next successful roll could send those acks and hence the batch will not be resent.

        The fix is to clear the unsent acks when there's an IOException in close. Also added a config property to disable the behavior for sinks where different close semantics apply.

        This addresses bug FLUME-883.

        https://issues.apache.org/jira/browse/FLUME-883

        Diffs

        -----

        flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java 20f60c6

        flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfiguration.java aeceb15

        flume-core/src/test/java/com/cloudera/flume/collector/TestCollectorSink.java e735f38

        Diff: https://reviews.apache.org/r/3214/diff

        Testing

        -------

        Added new test case.

        Ran CollectorSink tests, will run rest of the regression tests.

        Thanks,

        Prasad

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-12-16 00:33:38, Eric Sammer wrote: > flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java, line 193 > < https://reviews.apache.org/r/3214/diff/2/?file=64831#file64831line193 > > > Mark private? will do. thanks On 2011-12-16 00:33:38, Eric Sammer wrote: > flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java, line 196 > < https://reviews.apache.org/r/3214/diff/2/?file=64831#file64831line196 > > > I've mentally paged out the lock ordering of this code so I can't definitively state there's no deadlock here. I have to defer to you on this one. Just something to double (or triple) check. hmm .. I can't think of a deadlock case, anyway the patch has not changed the locking logic. At least I am not introducing a new deadlock Prasad ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3214/#review3935 ----------------------------------------------------------- On 2011-12-15 21:24:54, Prasad Mujumdar wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3214/ ----------------------------------------------------------- (Updated 2011-12-15 21:24:54) Review request for Eric Sammer. Summary ------- The E2E collector sink saves the batch tags as the batches are passed to the downstream sinks. The ACKs are flushed when the roller close the file. Currently for the HDFS sink, the close is the only operation that guarantees that data is safely stored. Hence the acks are sent on close. If for some reason, the writes fail then we don't send the acks assuming the data is lost. The E2E mechanism then resends the data. The problem is that if the close fails then we don't clear the accumulated acks for that current rolltag. Hence its possible that the next successful roll could send those acks and hence the batch will not be resent. The fix is to clear the unsent acks when there's an IOException in close. Also added a config property to disable the behavior for sinks where different close semantics apply. This addresses bug FLUME-883 . https://issues.apache.org/jira/browse/FLUME-883 Diffs ----- flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java 20f60c6 flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfiguration.java aeceb15 flume-core/src/test/java/com/cloudera/flume/collector/TestCollectorSink.java e735f38 Diff: https://reviews.apache.org/r/3214/diff Testing ------- Added new test case. Ran CollectorSink tests, will run rest of the regression tests. Thanks, Prasad
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3214/#review3935
        -----------------------------------------------------------

        Ship it!

        Looks right to me, save for my lack of memory of lock acquisition ordering.

        flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java
        <https://reviews.apache.org/r/3214/#comment8875>

        Super nit: if (cleanupOnClose) is the same as if (cleanupOnClose == true).

        flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java
        <https://reviews.apache.org/r/3214/#comment8877>

        Mark private?

        flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java
        <https://reviews.apache.org/r/3214/#comment8878>

        I've mentally paged out the lock ordering of this code so I can't definitively state there's no deadlock here. I have to defer to you on this one. Just something to double (or triple) check.

        flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java
        <https://reviews.apache.org/r/3214/#comment8876>

        Same super nit on if (... == true).

        • Eric

        On 2011-12-15 21:24:54, Prasad Mujumdar wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3214/

        -----------------------------------------------------------

        (Updated 2011-12-15 21:24:54)

        Review request for Eric Sammer.

        Summary

        -------

        The E2E collector sink saves the batch tags as the batches are passed to the downstream sinks. The ACKs are flushed when the roller close the file. Currently for the HDFS sink, the close is the only operation that guarantees that data is safely stored. Hence the acks are sent on close. If for some reason, the writes fail then we don't send the acks assuming the data is lost. The E2E mechanism then resends the data.

        The problem is that if the close fails then we don't clear the accumulated acks for that current rolltag. Hence its possible that the next successful roll could send those acks and hence the batch will not be resent.

        The fix is to clear the unsent acks when there's an IOException in close. Also added a config property to disable the behavior for sinks where different close semantics apply.

        This addresses bug FLUME-883.

        https://issues.apache.org/jira/browse/FLUME-883

        Diffs

        -----

        flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java 20f60c6

        flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfiguration.java aeceb15

        flume-core/src/test/java/com/cloudera/flume/collector/TestCollectorSink.java e735f38

        Diff: https://reviews.apache.org/r/3214/diff

        Testing

        -------

        Added new test case.

        Ran CollectorSink tests, will run rest of the regression tests.

        Thanks,

        Prasad

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3214/#review3935 ----------------------------------------------------------- Ship it! Looks right to me, save for my lack of memory of lock acquisition ordering. flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java < https://reviews.apache.org/r/3214/#comment8875 > Super nit: if (cleanupOnClose) is the same as if (cleanupOnClose == true). flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java < https://reviews.apache.org/r/3214/#comment8877 > Mark private? flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java < https://reviews.apache.org/r/3214/#comment8878 > I've mentally paged out the lock ordering of this code so I can't definitively state there's no deadlock here. I have to defer to you on this one. Just something to double (or triple) check. flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java < https://reviews.apache.org/r/3214/#comment8876 > Same super nit on if (... == true). Eric On 2011-12-15 21:24:54, Prasad Mujumdar wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3214/ ----------------------------------------------------------- (Updated 2011-12-15 21:24:54) Review request for Eric Sammer. Summary ------- The E2E collector sink saves the batch tags as the batches are passed to the downstream sinks. The ACKs are flushed when the roller close the file. Currently for the HDFS sink, the close is the only operation that guarantees that data is safely stored. Hence the acks are sent on close. If for some reason, the writes fail then we don't send the acks assuming the data is lost. The E2E mechanism then resends the data. The problem is that if the close fails then we don't clear the accumulated acks for that current rolltag. Hence its possible that the next successful roll could send those acks and hence the batch will not be resent. The fix is to clear the unsent acks when there's an IOException in close. Also added a config property to disable the behavior for sinks where different close semantics apply. This addresses bug FLUME-883 . https://issues.apache.org/jira/browse/FLUME-883 Diffs ----- flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java 20f60c6 flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfiguration.java aeceb15 flume-core/src/test/java/com/cloudera/flume/collector/TestCollectorSink.java e735f38 Diff: https://reviews.apache.org/r/3214/diff Testing ------- Added new test case. Ran CollectorSink tests, will run rest of the regression tests. Thanks, Prasad
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3214/
        -----------------------------------------------------------

        Review request for Eric Sammer.

        Summary
        -------

        The E2E collector sink saves the batch tags as the batches are passed to the downstream sinks. The ACKs are flushed when the roller close the file. Currently for the HDFS sink, the close is the only operation that guarantees that data is safely stored. Hence the acks are sent on close. If for some reason, the writes fail then we don't send the acks assuming the data is lost. The E2E mechanism then resends the data.
        The problem is that if the close fails then we don't clear the accumulated acks for that current rolltag. Hence its possible that the next successful roll could send those acks and hence the batch will not be resent.

        The fix is to clear the unsent acks when there's an IOException in close. Also added a config property to disable the behavior for sinks where different close semantics apply.

        This addresses bug FLUME-883.
        https://issues.apache.org/jira/browse/FLUME-883

        Diffs


        flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java 20f60c6
        flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfiguration.java aeceb15
        flume-core/src/test/java/com/cloudera/flume/collector/TestCollectorSink.java e735f38

        Diff: https://reviews.apache.org/r/3214/diff

        Testing
        -------

        Added new test case.
        Ran CollectorSink tests, will run rest of the regression tests.

        Thanks,

        Prasad

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3214/ ----------------------------------------------------------- Review request for Eric Sammer. Summary ------- The E2E collector sink saves the batch tags as the batches are passed to the downstream sinks. The ACKs are flushed when the roller close the file. Currently for the HDFS sink, the close is the only operation that guarantees that data is safely stored. Hence the acks are sent on close. If for some reason, the writes fail then we don't send the acks assuming the data is lost. The E2E mechanism then resends the data. The problem is that if the close fails then we don't clear the accumulated acks for that current rolltag. Hence its possible that the next successful roll could send those acks and hence the batch will not be resent. The fix is to clear the unsent acks when there's an IOException in close. Also added a config property to disable the behavior for sinks where different close semantics apply. This addresses bug FLUME-883 . https://issues.apache.org/jira/browse/FLUME-883 Diffs flume-core/src/main/java/com/cloudera/flume/collector/CollectorSink.java 20f60c6 flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfiguration.java aeceb15 flume-core/src/test/java/com/cloudera/flume/collector/TestCollectorSink.java e735f38 Diff: https://reviews.apache.org/r/3214/diff Testing ------- Added new test case. Ran CollectorSink tests, will run rest of the regression tests. Thanks, Prasad
        Prasad Mujumdar created issue -

          People

          • Assignee:
            Prasad Mujumdar
            Reporter:
            Prasad Mujumdar
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development