Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1408

Allow customization of job submission policies

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: contrib/gridmix
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently, gridmix3 replay job submission faithfully. For evaluation purposes, it would be great if we can support other job submission policies such as sequential job submission, or stress job submission.

      1. 1408-1.patch
        66 kB
        rahul k singh
      2. 1408-2.patch
        69 kB
        rahul k singh
      3. 1408-2.patch
        69 kB
        rahul k singh
      4. 1408-20.patch
        70 kB
        rahul k singh
      5. 1408-20-2.patch
        79 kB
        rahul k singh
      6. 1408-20-3.patch
        78 kB
        rahul k singh
      7. 1408-20-4.patch
        78 kB
        rahul k singh
      8. 1408-3.patch
        75 kB
        rahul k singh
      9. 1408-4.patch
        74 kB
        rahul k singh
      10. 1408-5.patch
        74 kB
        rahul k singh

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #250 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/250/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #250 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/250/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #262 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/262/)
          . Add customizable job submission policies to Gridmix. Contributed by Rahul Singh

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #262 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/262/ ) . Add customizable job submission policies to Gridmix. Contributed by Rahul Singh
          Hide
          Chris Douglas added a comment -

          This doesn't need to go through Hudson again; the only changes were to constants and the relevant test case passes.

          +1

          I committed this. Thanks, Rahul!

          Show
          Chris Douglas added a comment - This doesn't need to go through Hudson again; the only changes were to constants and the relevant test case passes. +1 I committed this. Thanks, Rahul!
          Hide
          rahul k singh added a comment -

          A very minute change in DebugJobProducer

          Show
          rahul k singh added a comment - A very minute change in DebugJobProducer
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12437859/1408-4.patch
          against trunk revision 918864.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/18/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/18/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437859/1408-4.patch against trunk revision 918864. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/18/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/18/console This message is automatically generated.
          Hide
          rahul k singh added a comment -

          running against hudson

          Show
          rahul k singh added a comment - running against hudson
          Hide
          rahul k singh added a comment -

          added some offline comments by chris:

          JobFactory::FilterJobStory should remain a static class
          JobFactory::createReaderThread should be abstract, rather than returning an instance of Thread
          Initializing the polling interval as a final param- as the name is- in GridmixJobSubmissionPolicy would be more canonical

          Show
          rahul k singh added a comment - added some offline comments by chris: JobFactory::FilterJobStory should remain a static class JobFactory::createReaderThread should be abstract, rather than returning an instance of Thread Initializing the polling interval as a final param- as the name is- in GridmixJobSubmissionPolicy would be more canonical
          Hide
          rahul k singh added a comment -

          yahoo hadoop distribution patch

          Show
          rahul k singh added a comment - yahoo hadoop distribution patch
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12436445/1408-3.patch
          against trunk revision 915223.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/477/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/477/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/477/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/477/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12436445/1408-3.patch against trunk revision 915223. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/477/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/477/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/477/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/477/console This message is automatically generated.
          Hide
          rahul k singh added a comment -

          running against hudson

          Show
          rahul k singh added a comment - running against hudson
          Hide
          rahul k singh added a comment -

          Uploading the new patch , it has some changes forward ported , which were done to yahoo hadoop 20 branch to make it working on clusters.

          Show
          rahul k singh added a comment - Uploading the new patch , it has some changes forward ported , which were done to yahoo hadoop 20 branch to make it working on clusters.
          Hide
          rahul k singh added a comment -

          This is yahoo hadoop distribution patch , i ran yahoo distribution with this patch for around 12 - 13 hours. esp STRESS and SERIAL.

          Show
          rahul k singh added a comment - This is yahoo hadoop distribution patch , i ran yahoo distribution with this patch for around 12 - 13 hours. esp STRESS and SERIAL.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12435968/1408-2.patch
          against trunk revision 910465.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/324/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/324/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/324/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/324/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435968/1408-2.patch against trunk revision 910465. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/324/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/324/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/324/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/324/console This message is automatically generated.
          Hide
          rahul k singh added a comment -

          reuploading the same patch , and trying it with hudson

          Show
          rahul k singh added a comment - reuploading the same patch , and trying it with hudson
          Hide
          rahul k singh added a comment -

          patch for yahoo hadoop 20 branch

          Show
          rahul k singh added a comment - patch for yahoo hadoop 20 branch
          Hide
          rahul k singh added a comment -

          Implemented all the points mentioned above except:
          The JobMonitor::process method can probably be removed, since it doesn't do any meaningful work (the instance is instead passed to the downstream StatsCollector, which does). The monitor is then only responsible for determining when the job is done, not its outcome The unit test can subclass the downstream component instead of subclassing JobMonitor

          The above is not done , due to time constraints and also extend Statistics is taking time. Will open the new jira where we will remove the process method.

          For the question:
          If a job fails submission in serial mode, wil the JobFactory be woken up?
          yes

          Show
          rahul k singh added a comment - Implemented all the points mentioned above except: The JobMonitor::process method can probably be removed, since it doesn't do any meaningful work (the instance is instead passed to the downstream StatsCollector, which does). The monitor is then only responsible for determining when the job is done, not its outcome The unit test can subclass the downstream component instead of subclassing JobMonitor The above is not done , due to time constraints and also extend Statistics is taking time. Will open the new jira where we will remove the process method. For the question: If a job fails submission in serial mode, wil the JobFactory be woken up? yes
          Hide
          Chris Douglas added a comment -
          • All the fields in StatsCollector can be final
          • Jobs pushed in StatsCollector::add do not need to be retained in a list, since now only its size is used. For the current purpose, it is sufficient to keep only a counter
          • Instead of fixing MAX_JOBS_COMPLETED_IN_POLL_INTERVAL to 1, it would make sense to let this parameter be configurable. SERIAL submission should set it to 1 but STRESS can be less aggressive by default. It would make sense to let this be defined by the enum, as the polling interval is in the current patch
          • Excluding SERIAL where the max jobs per interval is implicit, letting the STRESS and REPLAY define defaults for configurable parameters rather than fixed values may make sense (for both max jobs/interval and the interval period)
          • The default of 1000ms for STRESS is too low
          • Setting the interrupt flag is unnecessary if that is the condition for shutting down the thread- as in StatsCollector.Collector::run and JobFactory::run. The only reason to reset the flag is if the current context does not know how to handle the interrupt, but its caller might and it cannot throw InterruptedException. Returning from run is correct in this case.
          • The StatListener interface, the new JobFactory subclasses, and StatsCollector should be package-private. If JobFactory is made an abstract base, it may make sense to make it public
          • StatsCollector::join should respect the timeout provided, not wait forever
          • This block in StressReaderThread::run:
            +          synchronized (lock) {
            +            while (loadStatus.isOverloaded) {
            +              //Wait while JT is overloaded.
            +              try {
            +                lock.wait();
            +              } catch (InterruptedException ie) {
            +                Thread.currentThread().interrupt();
            +              }
            +            }
            +          }
            

            Does not seem correct. If interrupted while waiting, InterruptedException will be thrown, caught, the flag reset, causing another exception to be thrown. If interrupted, the thread should exit. A similar pattern is in StatsCollector.Collector::run

          • As with SerialJobFactory, StressJobFactory should be submitting jobs to be run immediately, not at the interval recorded in the trace. The submission time in the trace can be disregarded entirely.
          • StatsCollector::addClusterStatsObservers has no callers; how does one subscribe to the statistics updates from the JT?
          • LoadCalculator::checkLoadAndGetSlotsAvailable should use commons logging at debug level, rather than System.out. Similarly, many of these statements are non-trivial to construct and should be guarded by LOG.isDebugEnabled()
          • Log message such as: LOG.error(ie.getCause(), ie) Should instead include a message and the exception (which will also include its cause)
          • The LoadCalculator methods can probably be part of StressJobFactory rather than static methods in a separate class
          Show
          Chris Douglas added a comment - All the fields in StatsCollector can be final Jobs pushed in StatsCollector::add do not need to be retained in a list, since now only its size is used. For the current purpose, it is sufficient to keep only a counter Instead of fixing MAX_JOBS_COMPLETED_IN_POLL_INTERVAL to 1, it would make sense to let this parameter be configurable. SERIAL submission should set it to 1 but STRESS can be less aggressive by default. It would make sense to let this be defined by the enum, as the polling interval is in the current patch Excluding SERIAL where the max jobs per interval is implicit, letting the STRESS and REPLAY define defaults for configurable parameters rather than fixed values may make sense (for both max jobs/interval and the interval period) The default of 1000ms for STRESS is too low Setting the interrupt flag is unnecessary if that is the condition for shutting down the thread- as in StatsCollector.Collector::run and JobFactory::run . The only reason to reset the flag is if the current context does not know how to handle the interrupt, but its caller might and it cannot throw InterruptedException . Returning from run is correct in this case. The StatListener interface, the new JobFactory subclasses, and StatsCollector should be package-private. If JobFactory is made an abstract base, it may make sense to make it public StatsCollector::join should respect the timeout provided, not wait forever This block in StressReaderThread::run : + synchronized (lock) { + while (loadStatus.isOverloaded) { + //Wait while JT is overloaded. + try { + lock.wait(); + } catch (InterruptedException ie) { + Thread.currentThread().interrupt(); + } + } + } Does not seem correct. If interrupted while waiting, InterruptedException will be thrown, caught, the flag reset, causing another exception to be thrown. If interrupted, the thread should exit. A similar pattern is in StatsCollector.Collector::run As with SerialJobFactory , StressJobFactory should be submitting jobs to be run immediately, not at the interval recorded in the trace. The submission time in the trace can be disregarded entirely. StatsCollector::addClusterStatsObservers has no callers; how does one subscribe to the statistics updates from the JT? LoadCalculator::checkLoadAndGetSlotsAvailable should use commons logging at debug level, rather than System.out . Similarly, many of these statements are non-trivial to construct and should be guarded by LOG.isDebugEnabled() Log message such as: LOG.error(ie.getCause(), ie) Should instead include a message and the exception (which will also include its cause) The LoadCalculator methods can probably be part of StressJobFactory rather than static methods in a separate class
          Hide
          Chris Douglas added a comment -

          The patch looks good. A few notes after a first pass (I have not examined the statistics code, yet):

          • TestGridmixSubmission is an expensive test to run; the cluster startup/shutdown can be amortized across runs (as can the data generation). If this needs to clean up between jobs, consider using @Before and @After instead of making separate tests (i.e., no subclassing the unit test). The current patch also makes many fields protected without referencing them in subclasses and also makes static, inner classes into member classes for reasons that aren't obvious (e.g. FilterJobStory)
          • Very minor nits: the coding standards recommend whitespace after commas and before and after braces. Control statements also must enclose their blocks in braces e.g. "{{if (expr) { stmt }

            }}", never "if (expr) stmt", "foo(a, b, c) {" never "foo(a, b,c){". Some lines in the patch exceed 80 characters

          • DebugGridmix::createJobMonitor should allow IOExceptions to escape and fail instead of logging the exception
          • statistics is a more descriptive name than collector for the StatsCollector references
          • It would probably be cleaner to pull the REPLAY mode into a separate class, allowing JobFactory to be an abstract base.
          • The JobMonitor::process method can probably be removed, since it doesn't do any meaningful work (the instance is instead passed to the downstream StatsCollector, which does). The monitor is then only responsible for determining when the job is done, not its outcome The unit test can subclass the downstream component instead of subclassing JobMonitor
          • Instead of making JobFactory::rThread protected, adding an abstract createReaderThead would also allow it to remain final
          • The SerialJobFactory includes logic that preserves submission times, but it should be submitting the job so that it executes immediately. Since the queue between the JobFactory and JobSubmitter will never have more than 1 element in it, it should be sufficient to submit new jobs at the system time.
          • The SerialJobFactory should either be waiting on a condition or add new jobs to the submitter while holding the lock. Otherwise it's possible (though extremely unlikely) for signals to be missed.
          • If a job fails submission in serial mode, wil the JobFactory be woken up?
          • Instead of adding GridmixJobSubmissionPolicy::getPolicy, callers can use Configuration::getEnum (so callers may specify the default)
          • The poll interval in the submission policy enum should be specified as a final param on the instance (as the name is), not by an abstract method. The getName method is the same as its toString, isn't it? Is the "name" important?
          Show
          Chris Douglas added a comment - The patch looks good. A few notes after a first pass (I have not examined the statistics code, yet): TestGridmixSubmission is an expensive test to run; the cluster startup/shutdown can be amortized across runs (as can the data generation). If this needs to clean up between jobs, consider using @Before and @After instead of making separate tests (i.e., no subclassing the unit test). The current patch also makes many fields protected without referencing them in subclasses and also makes static, inner classes into member classes for reasons that aren't obvious (e.g. FilterJobStory ) Very minor nits: the coding standards recommend whitespace after commas and before and after braces. Control statements also must enclose their blocks in braces e.g. "{{if (expr) { stmt } }}", never " if (expr) stmt ", " foo(a, b, c) { " never " foo(a, b,c){ ". Some lines in the patch exceed 80 characters DebugGridmix::createJobMonitor should allow IOExceptions to escape and fail instead of logging the exception statistics is a more descriptive name than collector for the StatsCollector references It would probably be cleaner to pull the REPLAY mode into a separate class, allowing JobFactory to be an abstract base. The JobMonitor::process method can probably be removed, since it doesn't do any meaningful work (the instance is instead passed to the downstream StatsCollector , which does). The monitor is then only responsible for determining when the job is done, not its outcome The unit test can subclass the downstream component instead of subclassing JobMonitor Instead of making JobFactory::rThread protected, adding an abstract createReaderThead would also allow it to remain final The SerialJobFactory includes logic that preserves submission times, but it should be submitting the job so that it executes immediately. Since the queue between the JobFactory and JobSubmitter will never have more than 1 element in it, it should be sufficient to submit new jobs at the system time. The SerialJobFactory should either be waiting on a condition or add new jobs to the submitter while holding the lock. Otherwise it's possible (though extremely unlikely) for signals to be missed. If a job fails submission in serial mode, wil the JobFactory be woken up? Instead of adding GridmixJobSubmissionPolicy::getPolicy , callers can use Configuration::getEnum (so callers may specify the default) The poll interval in the submission policy enum should be specified as a final param on the instance (as the name is), not by an abstract method. The getName method is the same as its toString , isn't it? Is the "name" important?
          Hide
          rahul k singh added a comment -

          First cut of the feature , there are some issues with new TestCases , working on it , will rectify the issues and will upload the new patch . Essentially core logic and design is implemented in this patch

          Show
          rahul k singh added a comment - First cut of the feature , there are some issues with new TestCases , working on it , will rectify the issues and will upload the new patch . Essentially core logic and design is implemented in this patch

            People

            • Assignee:
              rahul k singh
              Reporter:
              rahul k singh
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development