Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.7.0
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      There is a possibility of losing data with the current HDFS sequence file writer.

      Internally, the `SequenceFile.Writer` buffers data and periodically syncs it to the underlying output stream. The mechanism for doing this is dependent on whether you are using compression or not but in both scenarios, the key/values are appended to an internal buffer and only flushed to disk after the buffer reaches a certain size.

      Thus it is quite possible for Flume to lose messages if the agent crashes, or is stopped, before the internal buffer is flushed to disk.

      The correct action is to force the writer to sync its internal buffers to the underlying `FSDataOutputStream` first before calling hflush/sync.

      Additionally, I believe we should be calling hsync instead of hflush. Its my understanding writes with hsync should be more durable which I believe are the semantics we want here.

      1. FLUME-2922.patch
        5 kB
        Kevin Conaway

        Issue Links

          Activity

          Hide
          hshreedharan Hari Shreedharan added a comment -

          I agree, we should flush the hdfs sequence file writer.

          As for the hflush vs hsync: I don't know of any real-world HDFS-based system that uses hsync, because the hsync implementation is known to be not-optimized and hits performance pretty drastically. hflush is what most systems use - as it flushes to datanode memory. There is data loss only if all 3 datanodes go down before namenode detects under-replication and replicates the block - which is really unlikely. I believe the performance cost may not be worth it.

          Show
          hshreedharan Hari Shreedharan added a comment - I agree, we should flush the hdfs sequence file writer. As for the hflush vs hsync: I don't know of any real-world HDFS-based system that uses hsync, because the hsync implementation is known to be not-optimized and hits performance pretty drastically. hflush is what most systems use - as it flushes to datanode memory. There is data loss only if all 3 datanodes go down before namenode detects under-replication and replicates the block - which is really unlikely. I believe the performance cost may not be worth it.
          Hide
          kevinconaway Kevin Conaway added a comment -

          As for the hflush vs hsync: I don't know of any real-world HDFS-based system that uses hsync, because the hsync implementation is known to be not-optimized and hits performance pretty drastically. hflush is what most systems use - as it flushes to datanode memory. There is data loss only if all 3 datanodes go down before namenode detects under-replication and replicates the block - which is really unlikely. I believe the performance cost may not be worth it.

          OK, thanks. My understanding of this is still limited. I had referenced the HDFS storm integration as well as the Kafka HDFS Connector which both use hsync but I don't know how "real world" those integration are (yet)

          Show
          kevinconaway Kevin Conaway added a comment - As for the hflush vs hsync: I don't know of any real-world HDFS-based system that uses hsync, because the hsync implementation is known to be not-optimized and hits performance pretty drastically. hflush is what most systems use - as it flushes to datanode memory. There is data loss only if all 3 datanodes go down before namenode detects under-replication and replicates the block - which is really unlikely. I believe the performance cost may not be worth it. OK, thanks. My understanding of this is still limited. I had referenced the HDFS storm integration as well as the Kafka HDFS Connector which both use hsync but I don't know how "real world" those integration are (yet)
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user kevinconaway opened a pull request:

          https://github.com/apache/flume/pull/52

          FLUME-2922 Sync SequenceFile.Writer before calling hflush

          @harishreedharan will you please review?

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/kevinconaway/flume flume-2922

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flume/pull/52.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #52


          commit f03a2406bf44a8300522c1941293e2d74df88d28
          Author: Kevin Conaway <kevin.conaway@walmart.com>
          Date: 2016-06-09T19:50:13Z

          FLUME-2922 Sync SequenceFile.Writer before calling hflush


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user kevinconaway opened a pull request: https://github.com/apache/flume/pull/52 FLUME-2922 Sync SequenceFile.Writer before calling hflush @harishreedharan will you please review? You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinconaway/flume flume-2922 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flume/pull/52.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #52 commit f03a2406bf44a8300522c1941293e2d74df88d28 Author: Kevin Conaway <kevin.conaway@walmart.com> Date: 2016-06-09T19:50:13Z FLUME-2922 Sync SequenceFile.Writer before calling hflush
          Show
          kevinconaway Kevin Conaway added a comment - Attaching patch via https://patch-diff.githubusercontent.com/raw/apache/flume/pull/52.patch
          Hide
          kevinconaway Kevin Conaway added a comment -

          Hari Shreedharan did you get a chance to review this one?

          Show
          kevinconaway Kevin Conaway added a comment - Hari Shreedharan did you get a chance to review this one?
          Hide
          kevinconaway Kevin Conaway added a comment -

          Hari Shreedharan or Jarek Jarcec Cecho, is someone able to review this?

          Show
          kevinconaway Kevin Conaway added a comment - Hari Shreedharan or Jarek Jarcec Cecho , is someone able to review this?
          Hide
          mpercy Mike Percy added a comment -

          Kevin Conaway, this looks like a good change. I wouldn't mind committing this. However, I left some comments for you on the pull request regarding the test.

          (Regarding pull requests: I know we were told a long time ago not to use pull requests, but I've seen other ASF projects do it recently (such as Spark) and they are pretty convenient for code reviews... The biggest downside IMHO is that I don't want to apply merge commits, because they make the commit log hard to read, so I have to manually squash everything on commit. But I don't mind doing that.)

          Show
          mpercy Mike Percy added a comment - Kevin Conaway , this looks like a good change. I wouldn't mind committing this. However, I left some comments for you on the pull request regarding the test. (Regarding pull requests: I know we were told a long time ago not to use pull requests, but I've seen other ASF projects do it recently (such as Spark) and they are pretty convenient for code reviews... The biggest downside IMHO is that I don't want to apply merge commits, because they make the commit log hard to read, so I have to manually squash everything on commit. But I don't mind doing that.)
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Actually, since you can't merge via github (ASF does not give us write access via github) - we basically need to pull the branch into our local master and then commit it, but I think it will still bring in multiple commits. Spark does have a script that squashes the commits and then pushes it (and does some jira linking too).

          Show
          hshreedharan Hari Shreedharan added a comment - Actually, since you can't merge via github (ASF does not give us write access via github) - we basically need to pull the branch into our local master and then commit it, but I think it will still bring in multiple commits. Spark does have a script that squashes the commits and then pushes it (and does some jira linking too).
          Hide
          kevinconaway Kevin Conaway added a comment -

          I'm happy to do whatever works for you for merging. If you want to use the PR for code review, I can then attach a patch here when everyone is happy.

          Show
          kevinconaway Kevin Conaway added a comment - I'm happy to do whatever works for you for merging. If you want to use the PR for code review, I can then attach a patch here when everyone is happy.
          Hide
          mpercy Mike Percy added a comment -

          Hari Shreedharan, yeah you are right, everything is pretty manual in my process at this point. Download the PR patch, git am foo.patch, git rebase to squash and edit the commit message, etc. Do you have a link to the script Spark uses?

          Kevin Conaway, sorry, didn't mean to confuse you, the whole PR process is probably a discussion we should start in its own thread on the dev list Anyway, feel free to use the PR for this issue, for now, since I don't mind using it as a reviewer this time.

          Show
          mpercy Mike Percy added a comment - Hari Shreedharan , yeah you are right, everything is pretty manual in my process at this point. Download the PR patch, git am foo.patch, git rebase to squash and edit the commit message, etc. Do you have a link to the script Spark uses? Kevin Conaway , sorry, didn't mean to confuse you, the whole PR process is probably a discussion we should start in its own thread on the dev list Anyway, feel free to use the PR for this issue, for now, since I don't mind using it as a reviewer this time.
          Hide
          mpercy Mike Percy added a comment -

          Although I'm not trying to override Hari Shreedharan so Hari LMK if I'm stepping on your toes, just trying to help out w/ reviews and push the cart

          Show
          mpercy Mike Percy added a comment - Although I'm not trying to override Hari Shreedharan so Hari LMK if I'm stepping on your toes, just trying to help out w/ reviews and push the cart
          Hide
          hshreedharan Hari Shreedharan added a comment -

          No issues at all Mike Percy. I have not had the time recently to do reviews. This is the script that Spark uses to merge and this is the one to link the PR to jira.

          Show
          hshreedharan Hari Shreedharan added a comment - No issues at all Mike Percy . I have not had the time recently to do reviews. This is the script that Spark uses to merge and this is the one to link the PR to jira.
          Hide
          mpercy Mike Percy added a comment -

          +1. I am about to merge this pull request (manually, not using the above scripts, as it needs a couple tweaks before commit)

          Show
          mpercy Mike Percy added a comment - +1. I am about to merge this pull request (manually, not using the above scripts, as it needs a couple tweaks before commit)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 358bb670029549ed4cff192c79307fd3e4d69972 in flume's branch refs/heads/trunk from Kevin Conaway
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=358bb67 ]

          FLUME-2922. Sync SequenceFile.Writer before calling hflush

          This closes #52

          (Kevin Conaway via Mike Percy)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 358bb670029549ed4cff192c79307fd3e4d69972 in flume's branch refs/heads/trunk from Kevin Conaway [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=358bb67 ] FLUME-2922 . Sync SequenceFile.Writer before calling hflush This closes #52 (Kevin Conaway via Mike Percy)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flume/pull/52

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flume/pull/52
          Hide
          mpercy Mike Percy added a comment -

          Pushed to trunk. Thank you for the patch, Kevin!

          Show
          mpercy Mike Percy added a comment - Pushed to trunk. Thank you for the patch, Kevin!

            People

            • Assignee:
              kevinconaway Kevin Conaway
              Reporter:
              kevinconaway Kevin Conaway
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development