Hadoop Common
  1. Hadoop Common
  2. HADOOP-3113

DFSOututStream.flush() should flush data to real block file on DataNode.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Added sync() method to FSDataOutputStream to really, really persist data in HDFS. InterDatanodeProtocol to implement this feature.

      Description

      DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.

      However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.

      The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.

      The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.

      Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

      1. tmpFile.patch
        9 kB
        dhruba borthakur
      2. tmpFile.patch
        9 kB
        dhruba borthakur
      3. tmpFile.patch
        10 kB
        dhruba borthakur
      4. tmpFile.patch
        10 kB
        dhruba borthakur
      5. noTmpFile.patch
        2 kB
        dhruba borthakur
      6. noTmpFile.patch
        81 kB
        dhruba borthakur

        Issue Links

          Activity

          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]
          Tsz Wo Nicholas Sze made changes -
          Link This issue relates to HADOOP-4961 [ HADOOP-4961 ]
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Robert Chansler made changes -
          Hadoop Flags [Reviewed, Incompatible change] [Incompatible change, Reviewed]
          Release Note An application can invoke sync on the FSDataOutputStream to really, really persist data in HDFS! This is an incompatible change becuase it required changes to InterDatanodeProtocol. Added sync() method to FSDataOutputStream to really, really persist data in HDFS. InterDatanodeProtocol to implement this feature.
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Release Note An application can invoke sync on the FSDataOutputStream to really, really persist data in HDFS! This is an incompatible change becuase it required changes to InterDatanodeProtocol.
          Resolution Fixed [ 1 ]
          Hadoop Flags [Reviewed] [Incompatible change, Reviewed]
          Fix Version/s 0.18.0 [ 12312972 ]
          Tsz Wo Nicholas Sze made changes -
          Hadoop Flags [Reviewed]
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          dhruba borthakur made changes -
          Attachment tmpFile.patch [ 12383353 ]
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          dhruba borthakur made changes -
          Attachment tmpFile.patch [ 12383325 ]
          dhruba borthakur made changes -
          Attachment tmpFile.patch [ 12383316 ]
          dhruba borthakur made changes -
          Attachment tmpFile.patch [ 12383313 ]
          dhruba borthakur made changes -
          Attachment noTmpFile.patch [ 12382712 ]
          dhruba borthakur made changes -
          Link This issue is blocked by HADOOP-3310 [ HADOOP-3310 ]
          dhruba borthakur made changes -
          Summary Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode. DFSOututStream.flush() should flush data to real block file on DataNode.
          dhruba borthakur made changes -
          Link This issue blocks HADOOP-1700 [ HADOOP-1700 ]
          dhruba borthakur made changes -
          Attachment noTmpFile.patch [ 12378881 ]
          dhruba borthakur made changes -
          Field Original Value New Value
          Assignee dhruba borthakur [ dhruba ]
          dhruba borthakur created issue -

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development