Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-837

HdfsState ignores commits

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.0.0
    • storm-hdfs
    • None

    Description

      HdfsState works with trident which is supposed to provide exactly once processing. It does this two ways, first by informing the state about commits so it can be sure the data is written out, and second by having a commit id, so that double commits can be handled.

      HdfsState ignores the beginCommit and commit calls, and with that ignores the ids. This means that if you use HdfsState and your worker crashes you may both lose data and get some data twice.

      At a minimum the flush and file rotation should be tied to the commit in some way. The commit ID should at a minimum be written out with the data so someone reading the data can have a hope of deduping it themselves.

      Also with the rotationActions it is possible for a file that was partially written is leaked, and never moved to the final location, because it is not rotated. I personally think the actions are too generic for this case and need to be deprecated.

      Attachments

        Activity

          People

            arunmahadevan Arun Mahadevan
            revans2 Robert Joseph Evans
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: