Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4256

Backport concatenation of files into a single file to branch-1

    Details

    • Type: Test Test
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.0
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:

      Description

      HDFS-222 added support concatenation of multiple files in a directory into a single file. This helps several use cases where writes can be parallelized and several folks have expressed in this functionality.

      This jira intends to make changes equivalent from HDFS-222 into branch-1 to be made available release 1.2.0.

      1. HDFS-4256.patch
        37 kB
        Sanjay Radia
      2. HDFS-4256-2.patch
        38 kB
        Sanjay Radia

        Issue Links

          Activity

          Hide
          Harsh J added a comment -

          HDFS-222 has a requirement that all files's block lengths be the same (not very practical for many usecases, even if it helps that you write data and pad them in parallel) and also required a protocol version upping.

          If you don't mind me asking, do we really need this too back-ported into branch-1 when branch-2 is nearing a stable release?

          Show
          Harsh J added a comment - HDFS-222 has a requirement that all files's block lengths be the same (not very practical for many usecases, even if it helps that you write data and pad them in parallel) and also required a protocol version upping. If you don't mind me asking, do we really need this too back-ported into branch-1 when branch-2 is nearing a stable release?
          Hide
          Suresh Srinivas added a comment -

          HDFS-222 has a requirement that all files's block lengths be the same (not very practical for many usecases, even if it helps that you write data and pad them in parallel) and also required a protocol version upping.

          I understand HDFS-222 and its limitations. I had worked closely on that. I also think protocol version upping can be done if we adding it in another release like 1.2.0 right? Is there any issue with that?

          If you don't mind me asking, do we really need this too back-ported into branch-1 when branch-2 is nearing a stable release?

          Even though branch-2 is nearing stable release, there are many folks who want to remain on the releases that they are running. You see that in the questions asked on the mailing list where people are still 0.20 release. I feel that if there are features that are of interest, we should be able to add them back to earlier releases.

          Show
          Suresh Srinivas added a comment - HDFS-222 has a requirement that all files's block lengths be the same (not very practical for many usecases, even if it helps that you write data and pad them in parallel) and also required a protocol version upping. I understand HDFS-222 and its limitations. I had worked closely on that. I also think protocol version upping can be done if we adding it in another release like 1.2.0 right? Is there any issue with that? If you don't mind me asking, do we really need this too back-ported into branch-1 when branch-2 is nearing a stable release? Even though branch-2 is nearing stable release, there are many folks who want to remain on the releases that they are running. You see that in the questions asked on the mailing list where people are still 0.20 release. I feel that if there are features that are of interest, we should be able to add them back to earlier releases.
          Hide
          Harsh J added a comment -

          Suresh,

          Thanks for commenting back!

          I also think protocol version upping can be done if we adding it in another release like 1.2.0 right? Is there any issue with that?

          You see that in the questions asked on the mailing list where people are still 0.20 release.

          I am not strongly against this; but I do feel users wouldn't move to a higher releases unless we urged them (we release it after all) and we haven't been doing that in any good form.

          This just feels like more backward work to me (moving features backwards, 'stead of just stability fixes and optimizations). But am not intending to block it in any way - just wished to voice this cause users of HDFS stand to benefit more with a full move ahead. Please go ahead

          Show
          Harsh J added a comment - Suresh, Thanks for commenting back! I also think protocol version upping can be done if we adding it in another release like 1.2.0 right? Is there any issue with that? You see that in the questions asked on the mailing list where people are still 0.20 release. I am not strongly against this; but I do feel users wouldn't move to a higher releases unless we urged them (we release it after all) and we haven't been doing that in any good form. This just feels like more backward work to me (moving features backwards, 'stead of just stability fixes and optimizations). But am not intending to block it in any way - just wished to voice this cause users of HDFS stand to benefit more with a full move ahead. Please go ahead
          Hide
          Suresh Srinivas added a comment -

          This work was done three years ago!

          This just feels like more backward work to me

          I actually do not think so. No one is proposing porting all the features or large features like Federation. That will be a lot of work. On the contrary, not porting small improvements to stable releases has resulted people putting together their own Frankenstein releases or move to non Apache releases into other distributions that backport changes.

          But am not intending to block it in any way...

          I have seen these type of discussions in a couple of the jiras In the end what goes in is a call of Release Manager. As long we are not backporting changes that result in huge stability issues or if it is mitigated with adequate testing, I do not see why back porting should be discouraged.

          Show
          Suresh Srinivas added a comment - This work was done three years ago! This just feels like more backward work to me I actually do not think so. No one is proposing porting all the features or large features like Federation. That will be a lot of work. On the contrary, not porting small improvements to stable releases has resulted people putting together their own Frankenstein releases or move to non Apache releases into other distributions that backport changes. But am not intending to block it in any way... I have seen these type of discussions in a couple of the jiras In the end what goes in is a call of Release Manager. As long we are not backporting changes that result in huge stability issues or if it is mitigated with adequate testing, I do not see why back porting should be discouraged.
          Hide
          Harsh J added a comment -

          This work was done three years ago!

          I didn't call out on this (stability aspect?). Unsure why you're exclaiming it, thats all

          On the contrary, not porting small improvements to stable releases…

          We have a release out there we aren't yet encouraging adoption upon, which has all this in it already.

          In any case, are we also intending to do the DistCp optimizations that can benefit from this, i.e. MAPREDUCE-2117 and MAPREDUCE-2557, so that there's also an internal use of this, out of the box?

          Show
          Harsh J added a comment - This work was done three years ago! I didn't call out on this (stability aspect?). Unsure why you're exclaiming it, thats all On the contrary, not porting small improvements to stable releases… We have a release out there we aren't yet encouraging adoption upon, which has all this in it already. In any case, are we also intending to do the DistCp optimizations that can benefit from this, i.e. MAPREDUCE-2117 and MAPREDUCE-2557 , so that there's also an internal use of this, out of the box?
          Hide
          Suresh Srinivas added a comment - - edited

          I didn't call out on this (stability aspect?). Unsure why you're exclaiming it, thats all

          Exclamation was to indicate that this a very old change and not the stability aspect.

          We have a release out there we aren't yet encouraging adoption upon, which has all this in it already.

          Not sure I understand. In case you are talking about 0.21, you know the status of that as well as 0.22.

          are we also intending to do the DistCp optimizations that can benefit from this, i.e. MAPREDUCE-2117 and MAPREDUCE-2557, so that there's also an internal use of this, out of the box

          Right now I was not planning to. I encourage you to port it to 1.0, if you think the users benefit from it.

          Show
          Suresh Srinivas added a comment - - edited I didn't call out on this (stability aspect?). Unsure why you're exclaiming it, thats all Exclamation was to indicate that this a very old change and not the stability aspect. We have a release out there we aren't yet encouraging adoption upon, which has all this in it already. Not sure I understand. In case you are talking about 0.21, you know the status of that as well as 0.22. are we also intending to do the DistCp optimizations that can benefit from this, i.e. MAPREDUCE-2117 and MAPREDUCE-2557 , so that there's also an internal use of this, out of the box Right now I was not planning to. I encourage you to port it to 1.0, if you think the users benefit from it.
          Hide
          Sanjay Radia added a comment -

          Since this is an addition of a new rpc method in clientProtocol (not a change in signature or deletion of an existing method), wire compatibility remains and hence the version number does not need to be bumped.

          Show
          Sanjay Radia added a comment - Since this is an addition of a new rpc method in clientProtocol (not a change in signature or deletion of an existing method), wire compatibility remains and hence the version number does not need to be bumped.
          Hide
          Sanjay Radia added a comment -

          Patch for branch-1.
          While I followed the patch of HDFS-222,
          I used the updated code from trunk since it had improved code.

          Show
          Sanjay Radia added a comment - Patch for branch-1. While I followed the patch of HDFS-222 , I used the updated code from trunk since it had improved code.
          Hide
          Suresh Srinivas added a comment -

          Some comments:

          1. In javadoc when referencing the parameter such srcs and target, use {@code srcs}

            .

          2. Move editlog logging to outside the synchronized section
          3. Layouver version -33 is already taken. So please use -41 and reserve that in LayoutVersion.java in trunk, using a separate jira.
          4. Not sure why DFSTestUtil.java changes are needed?
          5. It is better to mention the order in which blocks are concatenated to the target.
          Show
          Suresh Srinivas added a comment - Some comments: In javadoc when referencing the parameter such srcs and target, use {@code srcs} . Move editlog logging to outside the synchronized section Layouver version -33 is already taken. So please use -41 and reserve that in LayoutVersion.java in trunk, using a separate jira. Not sure why DFSTestUtil.java changes are needed? It is better to mention the order in which blocks are concatenated to the target.
          Hide
          Suresh Srinivas added a comment -

          HDFS-4296 created for reserving layout version.

          Show
          Suresh Srinivas added a comment - HDFS-4296 created for reserving layout version.
          Hide
          Suresh Srinivas added a comment -

          I have committed HDFS-4296 and have reserved -41 for the layout version to be used in this patch.

          Show
          Suresh Srinivas added a comment - I have committed HDFS-4296 and have reserved -41 for the layout version to be used in this patch.
          Hide
          Sanjay Radia added a comment -

          Updated patch.

          • The editlog loggin within the syncroniozed block does not sync the log; this code is similar to the other methods in that class.
          • DFSTestUtil change was made to ensure that the test files use the right block size. DFSTestUtil had been fixed in truck to do that and hence I backported it.
          Show
          Sanjay Radia added a comment - Updated patch. The editlog loggin within the syncroniozed block does not sync the log; this code is similar to the other methods in that class. DFSTestUtil change was made to ensure that the test files use the right block size. DFSTestUtil had been fixed in truck to do that and hence I backported it.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch.

          Show
          Suresh Srinivas added a comment - +1 for the patch.
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop 1.2.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.

            People

            • Assignee:
              Sanjay Radia
              Reporter:
              Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development