Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-641

Move all of the benchmarks and tests that depend on mapreduce to mapreduce

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.2
    • Fix Version/s: 0.21.0
    • Component/s: test
    • Labels:
      None

      Description

      Currently, we have a bad cycle where to build hdfs you need to test mapreduce and iterate once. This is broken.

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          While that is true, we need all 3 trunks (common, hdfs, and mapreduce) to depend on the current SNAPSHOT of each other. Otherwise, a change a common will break all of them. Using a fixed version of mapreduce for HDFS testing wouldn't work because that fixed version would become broken. In short, we would need to depend on a specific SNAPSHOT version of mapreduce. When would that be updated? Who would update it?

          Ah.. I didn't realize that. That definitely is a problem with letting HDFS using a particular version of mapred. Thanks, owen, for making this clear!

          The majority of the move is about the tools and benchmarks using HDFS and MapReduce that are better served being in MapReduce. Some of the the tests should be recoded without MapReduce and pushed back into HDFS.

          I welcome this.

          In general, I can share your concerns w.r.t cyclic dependencies. Just that I felt this might not be the right fix. Even that, I myself don't pretend to have any idea of the right solution for this.

          One idea I can get from looking at the cyclic dependencies is that the sore points are the MiniMR and MiniDFS clusters. Leaving these behind, I guess mapred and hdfs are more or less completely independent. So one solution could be to create MR and DFS cluster interfaces and let mapred/hdfs code to these interfaces? Thoughts?

          By the way, I am OK for not reverting this patch for now as it atleast addresses the cyclic dependency problem in one particular way. But we should definitely pursue other solutions, if any, perhaps on a different JIRA issue.

          Show
          Vinod Kumar Vavilapalli added a comment - While that is true, we need all 3 trunks (common, hdfs, and mapreduce) to depend on the current SNAPSHOT of each other. Otherwise, a change a common will break all of them. Using a fixed version of mapreduce for HDFS testing wouldn't work because that fixed version would become broken. In short, we would need to depend on a specific SNAPSHOT version of mapreduce. When would that be updated? Who would update it? Ah.. I didn't realize that. That definitely is a problem with letting HDFS using a particular version of mapred. Thanks, owen, for making this clear! The majority of the move is about the tools and benchmarks using HDFS and MapReduce that are better served being in MapReduce. Some of the the tests should be recoded without MapReduce and pushed back into HDFS. I welcome this. In general, I can share your concerns w.r.t cyclic dependencies. Just that I felt this might not be the right fix. Even that, I myself don't pretend to have any idea of the right solution for this. One idea I can get from looking at the cyclic dependencies is that the sore points are the MiniMR and MiniDFS clusters. Leaving these behind, I guess mapred and hdfs are more or less completely independent. So one solution could be to create MR and DFS cluster interfaces and let mapred/hdfs code to these interfaces? Thoughts? By the way, I am OK for not reverting this patch for now as it atleast addresses the cyclic dependency problem in one particular way. But we should definitely pursue other solutions, if any, perhaps on a different JIRA issue.
          Hide
          Lee Tucker added a comment -

          Circular dependencies are inherently evil, and fragile, and checked in jar files doubly so. This cycle really needs to be broken for good build/test hygiene. While it is a step backwards for HDFS users to have to build and maintain an MR layer to test their fixes, it's really not much different than before the project split, except that now the dependency is clearly exposed.

          Perhaps there's a solution that doesn't involve using MR to test HDFS? That would resolve both the circular dependency, and the delayed testing of HDFS (because you're waiting for MR to build).

          Show
          Lee Tucker added a comment - Circular dependencies are inherently evil, and fragile, and checked in jar files doubly so. This cycle really needs to be broken for good build/test hygiene. While it is a step backwards for HDFS users to have to build and maintain an MR layer to test their fixes, it's really not much different than before the project split, except that now the dependency is clearly exposed. Perhaps there's a solution that doesn't involve using MR to test HDFS? That would resolve both the circular dependency, and the delayed testing of HDFS (because you're waiting for MR to build).
          Hide
          Doug Cutting added a comment -

          > Sharad/Hemant/Dhruba/Me all were either -1 or not +1 for the fix that got committed.

          No clear -1 votes were cast, but It's still not too late. If a committer feels this is wrong, it can be reverted and discussed further. We require one +1 review before commits, which I don't see above, and, even then, committers should still be able to veto and revert a change for the first few days after it's been committed in case they didn't have a chance to review it before it was committed.

          Show
          Doug Cutting added a comment - > Sharad/Hemant/Dhruba/Me all were either -1 or not +1 for the fix that got committed. No clear -1 votes were cast, but It's still not too late. If a committer feels this is wrong, it can be reverted and discussed further. We require one +1 review before commits, which I don't see above, and, even then, committers should still be able to veto and revert a change for the first few days after it's been committed in case they didn't have a chance to review it before it was committed.
          Hide
          Owen O'Malley added a comment -

          Sigh I apologize for not closing the loop on the complaints before committing this. It had been so long since I started the process that I lost track that you still weren't happy with the solution.

          Cycles in the dependency graph make everything harder. Until we get the interfaces locked down more and Common becomes more stable, there will be cross-project commits. While that is true, we need all 3 trunks (common, hdfs, and mapreduce) to depend on the current SNAPSHOT of each other. Otherwise, a change a common will break all of them. Using a fixed version of mapreduce for HDFS testing wouldn't work because that fixed version would become broken. In short, we would need to depend on a specific SNAPSHOT version of mapreduce. When would that be updated? Who would update it?

          The majority of the move is about the tools and benchmarks using HDFS and MapReduce that are better served being in MapReduce. Some of the the tests should be recoded without MapReduce and pushed back into HDFS. However, most of the tests are the distributed equivalent of the tests still in HDFS and aren't that badly placed in MapReduce.

          Having a strict build order without cycles is a huge win, especially for automated build systems like Hudson running CI builds for us. With HADOOP-5107, we no longer need to check in jar files. That makes life far easier for developers. Automated CI builds are big part of it.

          Show
          Owen O'Malley added a comment - Sigh I apologize for not closing the loop on the complaints before committing this. It had been so long since I started the process that I lost track that you still weren't happy with the solution. Cycles in the dependency graph make everything harder. Until we get the interfaces locked down more and Common becomes more stable, there will be cross-project commits. While that is true, we need all 3 trunks (common, hdfs, and mapreduce) to depend on the current SNAPSHOT of each other. Otherwise, a change a common will break all of them. Using a fixed version of mapreduce for HDFS testing wouldn't work because that fixed version would become broken. In short, we would need to depend on a specific SNAPSHOT version of mapreduce. When would that be updated? Who would update it? The majority of the move is about the tools and benchmarks using HDFS and MapReduce that are better served being in MapReduce. Some of the the tests should be recoded without MapReduce and pushed back into HDFS. However, most of the tests are the distributed equivalent of the tests still in HDFS and aren't that badly placed in MapReduce. Having a strict build order without cycles is a huge win, especially for automated build systems like Hudson running CI builds for us. With HADOOP-5107 , we no longer need to check in jar files. That makes life far easier for developers. Automated CI builds are big part of it.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          It is unfortunate that we couldn't make any progress on the complaints we have with this fix.

          It is more unfortunate that the fix got committed clearly when no consensus was ever reached. Sharad/Hemant/Dhruba/Me all were either -1 or not +1 for the fix that got committed.

          Show
          Vinod Kumar Vavilapalli added a comment - It is unfortunate that we couldn't make any progress on the complaints we have with this fix. It is more unfortunate that the fix got committed clearly when no consensus was ever reached. Sharad/Hemant/Dhruba/Me all were either -1 or not +1 for the fix that got committed.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #143 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/143/)
          . Missed a testcase in the move.
          . Move all of the components that depend on map/reduce to
          map/reduce. (omalley)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #143 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/143/ ) . Missed a testcase in the move. . Move all of the components that depend on map/reduce to map/reduce. (omalley)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #119 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/119/)
          . Missed a testcase in the move.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #119 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/119/ ) . Missed a testcase in the move.
          Hide
          Owen O'Malley added a comment -

          I just committed this.

          Show
          Owen O'Malley added a comment - I just committed this.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #118 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/118/)
          . Move all of the components that depend on map/reduce to
          map/reduce. (omalley)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #118 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/118/ ) . Move all of the components that depend on map/reduce to map/reduce. (omalley)
          Hide
          Erik Steffl added a comment -

          Should the patch for HDFS-587 (0.21 version) also be part of this bug (committed to HDFS-641 branch)? It deals with the tests that were moved from Hdfs to Mapreduce.

          Show
          Erik Steffl added a comment - Should the patch for HDFS-587 (0.21 version) also be part of this bug (committed to HDFS-641 branch)? It deals with the tests that were moved from Hdfs to Mapreduce.
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #20 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/20/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #20 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/20/ )
          Hide
          Konstantin Shvachko added a comment -

          Sorry, I meant to say:
          I am going to commit HDFS-663 (FDSIO for append) to branch 0.21 and branch HDFS-641 if there are no objections.

          Show
          Konstantin Shvachko added a comment - Sorry, I meant to say: I am going to commit HDFS-663 (FDSIO for append) to branch 0.21 and branch HDFS-641 if there are no objections.
          Hide
          Konstantin Shvachko added a comment -

          I am going to commit HDFS-641 to branch 0.21 and branch HDFS-641 if there are no objections...

          Show
          Konstantin Shvachko added a comment - I am going to commit HDFS-641 to branch 0.21 and branch HDFS-641 if there are no objections...
          Hide
          Konstantin Shvachko added a comment -

          The same question will be with HDFS-236 if it needs to be committed before this is resolved.

          Show
          Konstantin Shvachko added a comment - The same question will be with HDFS-236 if it needs to be committed before this is resolved.
          Hide
          Konstantin Shvachko added a comment -

          There is an addition to DFSIO benchmark related to append, see HDFS-663. I plan to commit it to branch 0.21. Do you want me also commit it to branch HDFS-641? I just don't want this changes to get lost in the transition process, wherever it goes.

          Show
          Konstantin Shvachko added a comment - There is an addition to DFSIO benchmark related to append, see HDFS-663 . I plan to commit it to branch 0.21. Do you want me also commit it to branch HDFS-641 ? I just don't want this changes to get lost in the transition process, wherever it goes.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #98 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/98/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #98 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/98/ )
          Hide
          dhruba borthakur added a comment -

          Yes, using a checked-in specific version of mapreduce should work well. Maybe this is true for all hdfs-tools that use Stable interfaces of mapreduce.

          Show
          dhruba borthakur added a comment - Yes, using a checked-in specific version of mapreduce should work well. Maybe this is true for all hdfs-tools that use Stable interfaces of mapreduce.
          Hide
          Sharad Agarwal added a comment -

          Since the tests in question use mapreduce as a tool, it can use a specific version of mapreduce not the SNAPSHOT version.
          common (SNAPSHOT) -> hdfs (SNAPSHOT) -> mapreduce
          mapreduce (specific version) -> hdfs (only for hdfs_with_mr tests)
          Above dependencies don't lead to cycles. The version of mapreduce used by hdfs tests doesn't need to be latest. It can upgrade to newer version whenever required in the same fashion as it would upgrade any third party lib.

          On a side note I think the issue of cycles only arise if we make the assumption that whenever common or hdfs is built, their jars should be updated in mapreduce, which is not correct IMO. That defeats the purpose of project split, no? Since these are different projects, they could have different release cycles in future. So building and validating the dependencies whenever the base version changes is not very correct. It should be looked at just like any other third party dependency. For example the change in third party library doesn't and can't validate that all projects using it are not broken. It is up to the project using it in case it wants to upgrade to newer version.
          Since at this stage each sub projects can't wait for the dependent project's changes to be released, they can work with SNAPSHOT versions. (HADOOP-5107).

          Show
          Sharad Agarwal added a comment - Since the tests in question use mapreduce as a tool , it can use a specific version of mapreduce not the SNAPSHOT version. common (SNAPSHOT) -> hdfs (SNAPSHOT) -> mapreduce mapreduce (specific version) -> hdfs (only for hdfs_with_mr tests) Above dependencies don't lead to cycles. The version of mapreduce used by hdfs tests doesn't need to be latest. It can upgrade to newer version whenever required in the same fashion as it would upgrade any third party lib. On a side note I think the issue of cycles only arise if we make the assumption that whenever common or hdfs is built, their jars should be updated in mapreduce, which is not correct IMO. That defeats the purpose of project split, no? Since these are different projects, they could have different release cycles in future. So building and validating the dependencies whenever the base version changes is not very correct. It should be looked at just like any other third party dependency. For example the change in third party library doesn't and can't validate that all projects using it are not broken. It is up to the project using it in case it wants to upgrade to newer version. Since at this stage each sub projects can't wait for the dependent project's changes to be released, they can work with SNAPSHOT versions. ( HADOOP-5107 ).
          Hide
          Tsz Wo Nicholas Sze added a comment -

          A similar problem happens with your proposal too no? Let's say someone changes some functionality in common/hdfs and do not update the corresponding the test-case in mapreduce(for e.g., org.apache.hadoop.security.authorize.TestServiceLevelAuthorization.java). This will NOT be detected till mapreduce is built. If we talk of Hudson, then this won't happen till a mapreduce build is triggered by Hudson, which will be only when some other mapreduce patch is committed.

          That's correct but this is no cycle if we remove the dependence from hdfs to mapreduce.

          BTW, tests like TestServiceLevelAuthorization should be spitted into individual hdfs and mapreduce tests.

          Show
          Tsz Wo Nicholas Sze added a comment - A similar problem happens with your proposal too no? Let's say someone changes some functionality in common/hdfs and do not update the corresponding the test-case in mapreduce(for e.g., org.apache.hadoop.security.authorize.TestServiceLevelAuthorization.java). This will NOT be detected till mapreduce is built. If we talk of Hudson, then this won't happen till a mapreduce build is triggered by Hudson, which will be only when some other mapreduce patch is committed. That's correct but this is no cycle if we remove the dependence from hdfs to mapreduce. BTW, tests like TestServiceLevelAuthorization should be spitted into individual hdfs and mapreduce tests.
          Hide
          Doug Cutting added a comment -

          > If not, what are the alternative suggestions?

          Perhaps we just need a higher-level build file, so that the standard working pattern is something like:

          svn co https://svn.apache.org/repos/asf/hadoop/build/trunk .
          svn co https://svn.apache.org/repos/asf/hadoop/common/trunk common
          svn co https://svn.apache.org/repos/asf/hadoop/hdfs/trunk hdfs
          svn co https://svn.apache.org/repos/asf/hadoop/mapreduce/trunk mapreduce
          ant test-hdfs
          ant test-mapreduce
          

          If you want to test against a different version, then checkout its branch or tag, or unpack its tarball. The test-mapreduce and test-hdfs targets would first build both hdfs and mapreduce, then build and run tests.

          Show
          Doug Cutting added a comment - > If not, what are the alternative suggestions? Perhaps we just need a higher-level build file, so that the standard working pattern is something like: svn co https://svn.apache.org/repos/asf/hadoop/build/trunk . svn co https://svn.apache.org/repos/asf/hadoop/common/trunk common svn co https://svn.apache.org/repos/asf/hadoop/hdfs/trunk hdfs svn co https://svn.apache.org/repos/asf/hadoop/mapreduce/trunk mapreduce ant test-hdfs ant test-mapreduce If you want to test against a different version, then checkout its branch or tag, or unpack its tarball. The test-mapreduce and test-hdfs targets would first build both hdfs and mapreduce, then build and run tests.
          Hide
          Hemanth Yamijala added a comment -

          I think the test cases in question are using map/reduce as a tool to submit jobs and are really not related to map/reduce. In that sense, it seems like HDFS is still the right place to keep these tests, no ?

          Show
          Hemanth Yamijala added a comment - I think the test cases in question are using map/reduce as a tool to submit jobs and are really not related to map/reduce. In that sense, it seems like HDFS is still the right place to keep these tests, no ?
          Hide
          dhruba borthakur added a comment -

          > If something changes in FS/HDFS, it looks very odd to go and change the corresponding tests in mapred.

          I agree with Vinod. Moreover, the change cannot happen atomically because this will require two patches, one for HDFS and then one for the unit tests in mapred.

          Show
          dhruba borthakur added a comment - > If something changes in FS/HDFS, it looks very odd to go and change the corresponding tests in mapred. I agree with Vinod. Moreover, the change cannot happen atomically because this will require two patches, one for HDFS and then one for the unit tests in mapred.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The problem is that we will have breakage that won't be detected until HDFS is updated.

          A similar problem happens with your proposal too no? Let's say someone changes some functionality in common/hdfs and do not update the corresponding the test-case in mapreduce(for e.g., org.apache.hadoop.security.authorize.TestServiceLevelAuthorization.java). This will NOT be detected till mapreduce is built. If we talk of Hudson, then this won't happen till a mapreduce build is triggered by Hudson, which will be only when some other mapreduce patch is committed.

          What do others in the community think of this problem in general? Is it OK to move these hdfs/common related tests/benchmarks into mapreduce? If not, what are the alternative suggestions?

          Show
          Vinod Kumar Vavilapalli added a comment - The problem is that we will have breakage that won't be detected until HDFS is updated. A similar problem happens with your proposal too no? Let's say someone changes some functionality in common/hdfs and do not update the corresponding the test-case in mapreduce(for e.g., org.apache.hadoop.security.authorize.TestServiceLevelAuthorization.java). This will NOT be detected till mapreduce is built. If we talk of Hudson, then this won't happen till a mapreduce build is triggered by Hudson, which will be only when some other mapreduce patch is committed. What do others in the community think of this problem in general? Is it OK to move these hdfs/common related tests/benchmarks into mapreduce? If not, what are the alternative suggestions?
          Hide
          Owen O'Malley added a comment -

          The problem is that we will have breakage that won't be detected until HDFS is updated. That would substantially increase the chance of missed problems.

          Show
          Owen O'Malley added a comment - The problem is that we will have breakage that won't be detected until HDFS is updated. That would substantially increase the chance of missed problems.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          With the approach that I've outlined, we will not have cycles. Building mapreduce doesn't need common/hdfs build again. This will be similar to what we will do once we have HADOOP-5107 and make mapreduce work with stable/intended versions of common/hdfs instead of always working on the latest versions.

          Show
          Vinod Kumar Vavilapalli added a comment - With the approach that I've outlined, we will not have cycles. Building mapreduce doesn't need common/hdfs build again. This will be similar to what we will do once we have HADOOP-5107 and make mapreduce work with stable/intended versions of common/hdfs instead of always working on the latest versions.
          Hide
          Owen O'Malley added a comment -

          That had been the plan, but in practice, it didn't work well. To build, you needed to compile common, and update the common jars in hdfs and mapred. Then you compile hdfs and push the jar to mapreduce. Then you compile mapreduce and push to hdfs. Then you compile hdfs-test and push to mapreduce. Then you compile mapreduce test and push it to hdfs. Then you run the hdfs tests. Then you run the mapreduce tests.

          By comparison, if we break the cycle, we can compile common, test common, compile hdfs, test hdfs, compile mapreduce, and test mapreduce. Yes, we need to do more work to test hdfs without mapreduce. But this is a good change.

          Show
          Owen O'Malley added a comment - That had been the plan, but in practice, it didn't work well. To build, you needed to compile common, and update the common jars in hdfs and mapred. Then you compile hdfs and push the jar to mapreduce. Then you compile mapreduce and push to hdfs. Then you compile hdfs-test and push to mapreduce. Then you compile mapreduce test and push it to hdfs. Then you run the hdfs tests. Then you run the mapreduce tests. By comparison, if we break the cycle, we can compile common, test common, compile hdfs, test hdfs, compile mapreduce, and test mapreduce. Yes, we need to do more work to test hdfs without mapreduce. But this is a good change.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The current planned move is moving file-system specific tests which seem out of place in mapreduce to me. If something changes in FS/HDFS, it looks very odd to go and change the corresponding tests in mapred.

          Why shouldn't HDFS tests and benchmarks just run with a specified version of mapreduce instead of the latest version. This will remove the dependency of building mapreduce every time a HDFS build is needed. If the tests/benchmarks need something cutting edge in mapreduce, then the corresponding mapreduce jar can be committed to HDFS at that point of time. Thoughts?

          Show
          Vinod Kumar Vavilapalli added a comment - The current planned move is moving file-system specific tests which seem out of place in mapreduce to me. If something changes in FS/HDFS, it looks very odd to go and change the corresponding tests in mapred. Why shouldn't HDFS tests and benchmarks just run with a specified version of mapreduce instead of the latest version. This will remove the dependency of building mapreduce every time a HDFS build is needed. If the tests/benchmarks need something cutting edge in mapreduce, then the corresponding mapreduce jar can be committed to HDFS at that point of time. Thoughts?
          Hide
          Konstantin Shvachko added a comment -

          We will also need to move block_forensics stuff from hdfs/contrib into map-reduce if we want to straighten up project dependencies.

          Show
          Konstantin Shvachko added a comment - We will also need to move block_forensics stuff from hdfs/contrib into map-reduce if we want to straighten up project dependencies.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1

          Show
          Tsz Wo Nicholas Sze added a comment - +1
          Hide
          Owen O'Malley added a comment -

          I propose moving all of the the tests and benchmarks that depend on map/reduce into map/reduce. That will enable us to build in a linear order.

          common -> hdfs -> mapreduce

          Show
          Owen O'Malley added a comment - I propose moving all of the the tests and benchmarks that depend on map/reduce into map/reduce. That will enable us to build in a linear order. common -> hdfs -> mapreduce

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development