XMLWordPrintableJSON

Details

    • New Feature
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • fs
    • None

    Description

      It will be good to add merge(Path dir1, Path dir2, ... ) api to HDFS. Semantics will be to move all files under dir1 to dir2 and doing a rename of files in case of collisions.
      In absence of this api, Hive[1] has to check for collision for each file and then come up unique name and try again and so on. This is inefficient in multiple ways:

      1) It generates huge number of calls on NN (atleast 2*number of source files in dir1)
      2) It suffers from TOCTOU[2] bug for client picked up name in case of collision.
      3) Whole operation is not atomic.

      A merge api outlined as above will be immensely useful for Hive and potentially to other HDFS users.

      [1] https://github.com/apache/hive/blob/release-2.0.0-rc1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2576
      [2] https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use

      Attachments

        1. HDFS_Merge_API_Proposal.pdf
          242 kB
          Xiaobing Zhou

        Activity

          People

            xiaobingo Xiaobing Zhou
            ashutoshc Ashutosh Chauhan
            Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

              Created:
              Updated: