Pig
  1. Pig
  2. PIG-2898

Parallel execution of e2e tests

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.11, 0.12.0
    • Component/s: e2e harness
    • Labels:
    • Patch Info:
      Patch Available

      Description

      Today it takes ~19 hours to run the full set of e2e tests in mapred mode. The bottleneck here is the client side, and per our observations it can help a lot if the e2e harness would be able to run tests in parallel threads.

      We prototyped changes in e2e harness allowing to run tests in a configurable number of threads. Preliminary results show more than 6x reduction in execution time when using a small 3-nodes M/R cluster with modest configuration. Going to share a patch shortly.

      1. PIG-2898-trunk-3.patch
        27 kB
        Ilya Katsov
      2. PIG-2898-trunk-8.patch
        37 kB
        Ivan A. Veselovsky

        Activity

        Hide
        Julien Le Dem added a comment -

        Hi Andrey, this sounds interesting.
        do you have a patch available?
        Julien

        Show
        Julien Le Dem added a comment - Hi Andrey, this sounds interesting. do you have a patch available? Julien
        Hide
        Ivan A. Veselovsky added a comment -

        We provided parallelized mode of the e2e tests execution using Parallel::ForkManager.
        Two parameters affect the behavior:
        1) file.fork.factor – max number of subprocesses when running test configuration files (.conf);
        2) fork.factor – max number of subprocesses when running tests within one .conf file.
        Total max number of subprocesses canot exceed the product of the 2 values.
        Value <= 1 mean no paralellizing.
        Example: ant -Dfork.factor=3 -Dfile.fork.factor=3 ... test-e2e

        The attached patch is to be applied to http://svn.apache.org/repos/asf/pig/branches/branch-0.9/ branch.

        The patch testing procedure gives the following results for the patch:
        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 24 new or modified tests.
        [exec]
        [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        [exec]

        Show
        Ivan A. Veselovsky added a comment - We provided parallelized mode of the e2e tests execution using Parallel::ForkManager. Two parameters affect the behavior: 1) file.fork.factor – max number of subprocesses when running test configuration files (.conf); 2) fork.factor – max number of subprocesses when running tests within one .conf file. Total max number of subprocesses canot exceed the product of the 2 values. Value <= 1 mean no paralellizing. Example: ant -Dfork.factor=3 -Dfile.fork.factor=3 ... test-e2e The attached patch is to be applied to http://svn.apache.org/repos/asf/pig/branches/branch-0.9/ branch. The patch testing procedure gives the following results for the patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 24 new or modified tests. [exec] [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
        Hide
        Ivan A. Veselovsky added a comment -

        the patch pig-2898-for-svn-branch-0.9.patch is attached.

        Show
        Ivan A. Veselovsky added a comment - the patch pig-2898-for-svn-branch-0.9.patch is attached.
        Hide
        Rohini Palaniswamy added a comment -

        Ivan,
        All the changes to e2e framework to get it working with H23 and benchmark caching is removed with this patch. I think this is because you started before https://issues.apache.org/jira/browse/PIG-2484 went into 0.9 branch. You will have to update the patch with those included.

        Show
        Rohini Palaniswamy added a comment - Ivan, All the changes to e2e framework to get it working with H23 and benchmark caching is removed with this patch. I think this is because you started before https://issues.apache.org/jira/browse/PIG-2484 went into 0.9 branch. You will have to update the patch with those included.
        Hide
        Ivan A. Veselovsky added a comment -

        Thanks, Rohini,
        I fixed some issues in the code and prepared the patch against "trunk". Later I will attach the patch for 0.10 also.

        Some more comments to this change:
        1) --startat (-st) test_harness.pl option is not supported in the group fork factor is greater than 1. This is because of the fact that several test groups are executed simultaneously.
        2) In a forked mode "Results so far, ..." lines on the console may show results that are less than the actual results achieved so far. This is because each subprocess prints only its own data and does not know about the progress of other subprocesses.

        Show
        Ivan A. Veselovsky added a comment - Thanks, Rohini, I fixed some issues in the code and prepared the patch against "trunk". Later I will attach the patch for 0.10 also. Some more comments to this change: 1) --startat (-st) test_harness.pl option is not supported in the group fork factor is greater than 1. This is because of the fact that several test groups are executed simultaneously. 2) In a forked mode "Results so far, ..." lines on the console may show results that are less than the actual results achieved so far. This is because each subprocess prints only its own data and does not know about the progress of other subprocesses.
        Hide
        Ivan A. Veselovsky added a comment -

        My testing shows that the same patch applies okay to "branch-0.10".

        Show
        Ivan A. Veselovsky added a comment - My testing shows that the same patch applies okay to "branch-0.10".
        Hide
        Ivan A. Veselovsky added a comment -

        Marking as Implemented since the patch is available.

        Show
        Ivan A. Veselovsky added a comment - Marking as Implemented since the patch is available.
        Hide
        Ivan A. Veselovsky added a comment -

        Also i'm attaching script xmlReport.pl with corrected calculation of the tests duration: xmlReport-fixed-duration.pl

        Show
        Ivan A. Veselovsky added a comment - Also i'm attaching script xmlReport.pl with corrected calculation of the tests duration: xmlReport-fixed-duration.pl
        Hide
        Ivan A. Veselovsky added a comment -

        Slightly improved version of the patch: it also resolves issues with collisions due to identical 'tmpPath' value in the parallel tests.

        Show
        Ivan A. Veselovsky added a comment - Slightly improved version of the patch: it also resolves issues with collisions due to identical 'tmpPath' value in the parallel tests.
        Hide
        Rohini Palaniswamy added a comment -

        Ivan,
        Re-opening the issue. Until the patch is committed, it cannot be be marked resolved. It needs to be left in patch available state.

        Show
        Rohini Palaniswamy added a comment - Ivan, Re-opening the issue. Until the patch is committed, it cannot be be marked resolved. It needs to be left in patch available state.
        Hide
        Ivan A. Veselovsky added a comment -

        "PIG-2898-fix-sub-prototypes.patch" (attached): additional patch that fixes subroutines prototypes: the original patch worked ok with perl 5.14.2, but did not work with perl 5.8.8. This patch resolves this issue. This patch is to be applied after "PIG-2898-against-trunk.patch".

        Show
        Ivan A. Veselovsky added a comment - " PIG-2898 -fix-sub-prototypes.patch" (attached): additional patch that fixes subroutines prototypes: the original patch worked ok with perl 5.14.2, but did not work with perl 5.8.8. This patch resolves this issue. This patch is to be applied after " PIG-2898 -against-trunk.patch".
        Hide
        Andrey Klochkov added a comment -

        changing the name of JIRA from "multithreaded" to "parallel" as we've done it with forks instead of threads.

        Show
        Andrey Klochkov added a comment - changing the name of JIRA from "multithreaded" to "parallel" as we've done it with forks instead of threads.
        Hide
        Ilya Katsov added a comment -

        Updating a patch with fixes for e2e local mode.

        Show
        Ilya Katsov added a comment - Updating a patch with fixes for e2e local mode.
        Hide
        Rohini Palaniswamy added a comment -

        Ilya,
        Thanks it worked for local mode. But there were some additional test failures than usual in H23. H20 tests are still running and I will update status on them tomorrow.

        Few comments:
        1) Can you change yarn.nodemanager.local-dirs to mapreduce.cluster.local.dir. yarn.nodemanager.local-dirs was a wrong suggestion from me. Tested with mapreduce.cluster.local.dir and it works.
        2) Can you change name of hadoop.mapred.dir to hadoop.mapred.local.dir as it is slightly confusing and make it configurable through commandline. In many cases /tmp gets full and would be good to have the ability to point to some other dir.
        3) I had some comments in the reviewboard. Can you incorporate them too and post an updated patch in reviewboard.

        Show
        Rohini Palaniswamy added a comment - Ilya, Thanks it worked for local mode. But there were some additional test failures than usual in H23. H20 tests are still running and I will update status on them tomorrow. Few comments: 1) Can you change yarn.nodemanager.local-dirs to mapreduce.cluster.local.dir. yarn.nodemanager.local-dirs was a wrong suggestion from me. Tested with mapreduce.cluster.local.dir and it works. 2) Can you change name of hadoop.mapred.dir to hadoop.mapred.local.dir as it is slightly confusing and make it configurable through commandline. In many cases /tmp gets full and would be good to have the ability to point to some other dir. 3) I had some comments in the reviewboard. Can you incorporate them too and post an updated patch in reviewboard.
        Hide
        Ivan A. Veselovsky added a comment -

        Hi, Rohini,
        all the mentioned suggestions were adderessed in the patch #5. This patch is cumulative: it aggregates all the changes made in previous patches.

        Notes:

        • In parallellized mode context (like "[myfile.conf-MyGroup]" is printed after the results due to formatting issues (some contexts are too long).
        • In trunk branch test streaming_local.conf/StreamingLocal_11 hangs in local mode (observed in both sequential and parallel execution modes). So, I recommend to comment it out to get full results.
        • The local dir parametrized with 'hadoop.mapred.local.dir' in ant, or 'HADOOP_MAPRED_LOCAL_DIR' in environment.
        • Debug output parametrized with 'e2e.debug' in ant, or 'E2E_DEBUG' in environment.
        Show
        Ivan A. Veselovsky added a comment - Hi, Rohini, all the mentioned suggestions were adderessed in the patch #5. This patch is cumulative: it aggregates all the changes made in previous patches. Notes: In parallellized mode context (like " [myfile.conf-MyGroup] " is printed after the results due to formatting issues (some contexts are too long). In trunk branch test streaming_local.conf/StreamingLocal_11 hangs in local mode (observed in both sequential and parallel execution modes). So, I recommend to comment it out to get full results. The local dir parametrized with 'hadoop.mapred.local.dir' in ant, or 'HADOOP_MAPRED_LOCAL_DIR' in environment. Debug output parametrized with 'e2e.debug' in ant, or 'E2E_DEBUG' in environment.
        Hide
        Ivan A. Veselovsky added a comment -

        Please see also the comments on the review board: https://reviews.apache.org/r/7053/ – the patch #5 uploaded there.

        Show
        Ivan A. Veselovsky added a comment - Please see also the comments on the review board: https://reviews.apache.org/r/7053/ – the patch #5 uploaded there.
        Hide
        Ivan A. Veselovsky added a comment -

        Performance measurement results with patch #5 in e2e local mode:
        1) the test results are the same;
        2) parallelized local mode ran 2.7 times faster than the sequential one (250 minutes vs. 91 minute).

        Show
        Ivan A. Veselovsky added a comment - Performance measurement results with patch #5 in e2e local mode: 1) the test results are the same; 2) parallelized local mode ran 2.7 times faster than the sequential one (250 minutes vs. 91 minute).
        Hide
        Ivan A. Veselovsky added a comment -

        Attached reviewed and tested versions of the patches (6-final): against "trunk" and "branch-0.10" respectively.

        Show
        Ivan A. Veselovsky added a comment - Attached reviewed and tested versions of the patches (6-final): against "trunk" and "branch-0.10" respectively.
        Hide
        Ivan A. Veselovsky added a comment -

        cancelling the prev. patch to test the last one.

        Show
        Ivan A. Veselovsky added a comment - cancelling the prev. patch to test the last one.
        Hide
        Ivan A. Veselovsky added a comment -

        testing the latest patch (-6-final) against trunk.

        Show
        Ivan A. Veselovsky added a comment - testing the latest patch (-6-final) against trunk.
        Hide
        Rohini Palaniswamy added a comment -

        +1 nonbinding.

        https://cwiki.apache.org/confluence/display/PIG/HowToTest#HowToTest-HowtoRune2eTests needs to be updated with the following information after the patch goes in.

        • the options fork.factor.conf.file (number of parallel forks for the test files) and fork.factor.group (number of forks for groups within a test file)
        • Parallel::Forkmanager perl module installation
        Show
        Rohini Palaniswamy added a comment - +1 nonbinding. https://cwiki.apache.org/confluence/display/PIG/HowToTest#HowToTest-HowtoRune2eTests needs to be updated with the following information after the patch goes in. the options fork.factor.conf.file (number of parallel forks for the test files) and fork.factor.group (number of forks for groups within a test file) Parallel::Forkmanager perl module installation
        Hide
        Daniel Dai added a comment -

        One thing I can see is the initialization cost is high when running in parallel mode. Test try to create a temp directory for every section even we only run a bunch of tests. I wonder if it is easy to do to create temp directory on demand rather than upfront?

        Show
        Daniel Dai added a comment - One thing I can see is the initialization cost is high when running in parallel mode. Test try to create a temp directory for every section even we only run a bunch of tests. I wonder if it is easy to do to create temp directory on demand rather than upfront?
        Hide
        Daniel Dai added a comment -

        One other question, is the test log in sequence in the case of parallel execution?

        Show
        Daniel Dai added a comment - One other question, is the test log in sequence in the case of parallel execution?
        Hide
        Rohini Palaniswamy added a comment -

        There is a logfile created per test group. Once the execution of test group completes, the log is appended to the main log file.

        Show
        Rohini Palaniswamy added a comment - There is a logfile created per test group. Once the execution of test group completes, the log is appended to the main log file.
        Hide
        Ivan A. Veselovsky added a comment -

        Hi, Daniel, Rohini,
        I implemented the required optimization which ensures that the local and HDFS directories are created only when needed (on demand).
        These changes are in newly attached "PIG-2898-trunk-7.patch".

        The idea of the fix is that we splitted methods #globalSetup() and #globalCleanup() into 2 parts: new methods #globalSetup2() and #globalClenup2() methods introduced. The method #globalSetup2() only invoked if there is some test to execute, and #globalCleanup2() is only invoked if #globalSetup2() was invoked.

        Also I in this patch I reverted one of previous changes that changed IPC::Run::run('mkdir' ...) to "mkpath" perl call because "mkpath" appears to have (at lest on my perl implementation 5.14.2) quite strange feature: it returns non-zero exit status with "No such file or directory" message if the directory we're attempting to create already exists. This behavior is unexpected and confusing because it contradicts to native "mkdir -p" and java.io.File#mkdirs() behavior. So, despite of the fact that IPC::Run::run is slower, I prefer to use it to avoid developer's trouble.

        Show
        Ivan A. Veselovsky added a comment - Hi, Daniel, Rohini, I implemented the required optimization which ensures that the local and HDFS directories are created only when needed (on demand). These changes are in newly attached " PIG-2898 -trunk-7.patch". The idea of the fix is that we splitted methods #globalSetup() and #globalCleanup() into 2 parts: new methods #globalSetup2() and #globalClenup2() methods introduced. The method #globalSetup2() only invoked if there is some test to execute, and #globalCleanup2() is only invoked if #globalSetup2() was invoked. Also I in this patch I reverted one of previous changes that changed IPC::Run::run('mkdir' ...) to "mkpath" perl call because "mkpath" appears to have (at lest on my perl implementation 5.14.2) quite strange feature: it returns non-zero exit status with "No such file or directory" message if the directory we're attempting to create already exists. This behavior is unexpected and confusing because it contradicts to native "mkdir -p" and java.io.File#mkdirs() behavior. So, despite of the fact that IPC::Run::run is slower, I prefer to use it to avoid developer's trouble.
        Hide
        Ivan A. Veselovsky added a comment -

        I'm also attaching new version of the patch patch agains "branch-0.10": PIG-2898-branch-0.10-7.patch.

        The new version of path agaings trunk is also uploaded to the review board: https://reviews.apache.org/r/7053/

        Show
        Ivan A. Veselovsky added a comment - I'm also attaching new version of the patch patch agains "branch-0.10": PIG-2898 -branch-0.10-7.patch. The new version of path agaings trunk is also uploaded to the review board: https://reviews.apache.org/r/7053/
        Hide
        Rohini Palaniswamy added a comment -

        +1. Thanks Ivan. Tested the new patch. It is now faster even for running two or three tests. It would have been nicer if you had named the new methods something more meaningful like onGroupRunSetup()/onGroupRunCleanup() instead of globalSetup2() or cleanupSetup2(). But I guess it is ok since it is only the e2e test script.

        Show
        Rohini Palaniswamy added a comment - +1. Thanks Ivan. Tested the new patch. It is now faster even for running two or three tests. It would have been nicer if you had named the new methods something more meaningful like onGroupRunSetup()/onGroupRunCleanup() instead of globalSetup2() or cleanupSetup2(). But I guess it is ok since it is only the e2e test script.
        Hide
        Daniel Dai added a comment -

        +1. Rohini, can you commit once you get permission?

        Show
        Daniel Dai added a comment - +1. Rohini, can you commit once you get permission?
        Hide
        Ivan A. Veselovsky added a comment -

        Hi, Rohini,
        in the attached "PIG-2898-trunk-8.patch" I renamed the methods globalSetup2()/globalCleanup2() to globalSetupConditional()/globalCleanupConditional() respectively.
        This is different from the names onGroupRunXxx() suggested by you because of 2 reasons:
        1) globalSetup2() executed once per test config file in sequential mode, and once per test group in parallel mode, so, "onGroupRun" is not quite exact.
        2) globalXxx2() methods complement the globalXxx() methods (are their conditional parts), so, their names should be similar to each other.

        Show
        Ivan A. Veselovsky added a comment - Hi, Rohini, in the attached " PIG-2898 -trunk-8.patch" I renamed the methods globalSetup2()/globalCleanup2() to globalSetupConditional()/globalCleanupConditional() respectively. This is different from the names onGroupRunXxx() suggested by you because of 2 reasons: 1) globalSetup2() executed once per test config file in sequential mode, and once per test group in parallel mode, so, "onGroupRun" is not quite exact. 2) globalXxx2() methods complement the globalXxx() methods (are their conditional parts), so, their names should be similar to each other.
        Hide
        Rohini Palaniswamy added a comment -

        Thanks Ivan. Will commit soon. Just FYI. It is not required to cleanup after by deleting older patches. We usually leave them in the jira so that the history of the progress is there.

        Show
        Rohini Palaniswamy added a comment - Thanks Ivan. Will commit soon. Just FYI. It is not required to cleanup after by deleting older patches. We usually leave them in the jira so that the history of the progress is there.
        Hide
        Rohini Palaniswamy added a comment -

        Patch committed to 0.11 and trunk. Thanks Ivan.

        Show
        Rohini Palaniswamy added a comment - Patch committed to 0.11 and trunk. Thanks Ivan.

          People

          • Assignee:
            Ivan A. Veselovsky
            Reporter:
            Andrey Klochkov
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development