Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2219

In HDFSBolt and SequenceFileBolt the files are overridden if they already exist



    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • storm-hdfs
    • None


      In both bolts the files are opened in create mode. That implies that if the file already exists it is overridden. So, if for some reason the bolt is restarted (rebalancing or some crash), the data is lost. I think that is specially grave. What's more, since the rotation number is stored in memory, all the files will be eventually wiped out.

      I think there are two possible approaches:

      • If the file already exists, open it in append mode. I see some problems here, (1) the tuples data written to the several rotations will not keep its order unless we jump to the last rotation, (2) the TimedRotationPolicy and other that rely on memory stored data will not behave exactly as expected and (3) if the case of the SequenceFileBolt, if the file has different compression code or type it will raise an exception. Besides, we should change the way the HDFSWriter handles the writing offset because it depends on the size of the Tuples being written and not on the size of the file (and that would affect the FileSizeRotationPolicy). This doesn't affect the SequenceFileWriter, since it is using the getLength() method of SequenceFile.Writer that handles the append mode properly.
      • If the file exists, move to the next rotation. The problem I see is that if the rotation number is not part of the file name it will enter in a endless loop. Another issue is that if the the restart of the bolt is caused by some problem that is not fixed after the restart, it could be creating new files infinitely until collapsing the NameNode.

      I guess the solution will be a mix of both approaches and I think I can be able to implement it. But first I would like to ask if anyone has any other concern about it.

      By the moment I just wrote a bolt that satisfies my use case, with Sequence Files opened in append mode if the file exists and rotating based on size. But this solution should be more general.




            Unassigned Unassigned
            yoelcabo Yoel Cabo Lopez
            0 Vote for this issue
            1 Start watching this issue