Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13294

Flushing writes to disk with libhdfs

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • libhdfs
    • None

    Description

      I'm working with an FTP server that writes into HDFS using libhdfs. I'd like to ensure that incoming files are persisted on datanode disks before returning success to clients. At present, power failures often mean lost blocks for recent uploads.

      The hsync() call and CreateFlag.SYNC_BLOCK open flags seem like the right direction, but there doesn't appear to be a way to set SYNC_BLOCK with the libhdfs interface. I believe hsync() only applies to the current block for a filehandle.

      Thoughts on implementing it:

      1. Use an existing 'close enough' fcntl flag to set SYNC_BLOCK?
           Maybe O_DIRECT? Or O_SYNC or O_DSYNC
           This would probably be the best, as it would keep the libhdfs interface the same, and older versions would ignore the flags.
      2. Make hdfsOpenFile2 and have it accept HDFS flags (instead of fcntl flags)?
      3. Provide a method in DFSOutputStream to set shouldSyncBlock on an existing stream, and a function in libhdfs to enable it?

      For flushing writes with libhdfs right now (using CDH5), I'm guessing my only option is to call hsync() after every 'block size' of writes, exactly on the boundary.

      Best regards,
      John

      Attachments

        Activity

          People

            Unassigned Unassigned
            jthiltges John Thiltges
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: