Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1526

Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: contrib/gridmix
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. 1526-yahadoop-20-101.patch
      28 kB
      rahul k singh
    2. 1526-yahadoop-20-101-2.patch
      30 kB
      rahul k singh
    3. 1526-yahadoop-20-101-3.patch
      30 kB
      rahul k singh
    4. ASF.LICENSE.NOT.GRANTED--1526-yhadoop-20-101-4.patch
      28 kB
      rahul k singh
    5. ASF.LICENSE.NOT.GRANTED--1526-yhadoop-20-101-4.patch
      28 kB
      rahul k singh

      Issue Links

        Activity

        Hide
        Chris Douglas added a comment -

        Fixed in MAPREDUCE-1840

        Show
        Chris Douglas added a comment - Fixed in MAPREDUCE-1840
        Hide
        rahul k singh added a comment -

        Added the new patch . Following comments are done in this.:

        • Restore the behavior of default seq == -1 for GridmixJob and
          GenerateData
        • why do we need this? wouldn't noOfRunningJobs always be the same as
          jobMaps.size()?
          +  private static AtomicInteger noOfRunningJobs= new AtomicInteger(0);
          
        • The proper logic for addJobStats should be: first check if seq < 0,
          if yes, ignore the job; then if jobdesc is null, we should throw
          exception instead of adjust the #maps == 1.
        • The implementation of Statistics.add(Job job) is still wrong: You
          should hold the return value of jobMaps.remove(), and call
          StatListener<JobStats>.update() with the return value iff the return
          value is not null.
        • We should eliminate the variable runningJobs in StressJobFactory?

        Minor things:

        • The following comments from my previous review were not addressed:
          > - I think the following statement should be Log.debug() instead of
          > Log.info() (and be protected by a check of LOG.isDebugEnabled()):
          >
          > -    if (LOG.isDebugEnabled()) {
          > -      LOG.info(
          > +    LOG.info(
          >         System.currentTimeMillis() + " Overloaded is " + 
          > Boolean.toString(
          >           overloaded) + " incompleteMapTasks " + relOp + " " +
          >           OVERLAOD_MAPTASK_MAPSLOT_RATIO + "*mapSlotCapacity" + "(" +
          >           incompleteMapTasks + " " + relOp + " " +
          >           OVERLAOD_MAPTASK_MAPSLOT_RATIO + "*" +
          >           clusterStatus.getMaxMapTasks() + ")");
          > -    }
          > +
          > 
        • static List<InputSplit> pullDescription(JobContext jobCtxt) can be
          implemented on top of GridmixJob.getJobSeqId
        • removed the redundant GridmixJob.getJobSeqId() calls.
        • fixed a minor bug in Statistics.addJobStats(Job, JobStats)
        Show
        rahul k singh added a comment - Added the new patch . Following comments are done in this.: Restore the behavior of default seq == -1 for GridmixJob and GenerateData why do we need this? wouldn't noOfRunningJobs always be the same as jobMaps.size()? + private static AtomicInteger noOfRunningJobs= new AtomicInteger(0); The proper logic for addJobStats should be: first check if seq < 0, if yes, ignore the job; then if jobdesc is null, we should throw exception instead of adjust the #maps == 1. The implementation of Statistics.add(Job job) is still wrong: You should hold the return value of jobMaps.remove(), and call StatListener<JobStats>.update() with the return value iff the return value is not null. We should eliminate the variable runningJobs in StressJobFactory? Minor things: The following comments from my previous review were not addressed: > - I think the following statement should be Log.debug() instead of > Log.info() (and be protected by a check of LOG.isDebugEnabled()): > > - if (LOG.isDebugEnabled()) { > - LOG.info( > + LOG.info( > System.currentTimeMillis() + " Overloaded is " + > Boolean.toString( > overloaded) + " incompleteMapTasks " + relOp + " " + > OVERLAOD_MAPTASK_MAPSLOT_RATIO + "*mapSlotCapacity" + "(" + > incompleteMapTasks + " " + relOp + " " + > OVERLAOD_MAPTASK_MAPSLOT_RATIO + "*" + > clusterStatus.getMaxMapTasks() + ")"); > - } > + > static List<InputSplit> pullDescription(JobContext jobCtxt) can be implemented on top of GridmixJob.getJobSeqId removed the redundant GridmixJob.getJobSeqId() calls. fixed a minor bug in Statistics.addJobStats(Job, JobStats)
        Hide
        rahul k singh added a comment -

        Small unused import and rand were . Rectified that issue

        Show
        rahul k singh added a comment - Small unused import and rand were . Rectified that issue
        Hide
        rahul k singh added a comment -

        Adding the new patch.

        Show
        rahul k singh added a comment - Adding the new patch.
        Hide
        Hong Tang added a comment -

        Mostly good. Detailed comments:

        Statistics.java:

        • Use job seq id instead of ORIGNAME+random number as key to track the job stats.
        • JobStats.setNoOfMaps should not be public. Better if you have a JobStats constructor that takes two parameters, and eliminate the need of the setters or the empty constructor.
        • You should avoid returning the whole jobMaps in ClusterStats, it became worse when you save the returned map in the member field of StressJobFactory. It seems sufficient to avoid them if you have the following methods in ClusterStatus
          • int getNumRunningJobs()
          • Collection<JobStats> getRunningJobStats()
        • The following code looks wrong. Why is an empty JobStats object created? Shouldn't you search from internal JobStats map, and only call listeners if an instance of JobStats is found? Also, the if statement seems redundant.
                try {
                  //Job is completed notify all the listeners.
                  if (jobStatListeners.size() > 0) {
                    for (StatListener<JobStats> l : jobStatListeners) {
                      JobStats stats = new JobStats();
                      l.update(stats);
                    }
                  }
          

        StressJobFactory.java:

        • Why do we need to add "volatile" to loadStatus?
        • I think the following statement should be Log.debug() instead of Log.info() (and be protected by a check of LOG.isDebugEnabled()):
          -    if (LOG.isDebugEnabled()) {
          -      LOG.info(
          +    LOG.info(
                   System.currentTimeMillis() + " Overloaded is " + Boolean.toString(
                     overloaded) + " incompleteMapTasks " + relOp + " " +
                     OVERLAOD_MAPTASK_MAPSLOT_RATIO + "*mapSlotCapacity" + "(" +
                     incompleteMapTasks + " " + relOp + " " +
                     OVERLAOD_MAPTASK_MAPSLOT_RATIO + "*" +
                     clusterStatus.getMaxMapTasks() + ")");
          -    }
          +
          

        Misc:

        • Some indentation problem in JobSubmitter.java
        Show
        Hong Tang added a comment - Mostly good. Detailed comments: Statistics.java: Use job seq id instead of ORIGNAME+random number as key to track the job stats. JobStats.setNoOfMaps should not be public. Better if you have a JobStats constructor that takes two parameters, and eliminate the need of the setters or the empty constructor. You should avoid returning the whole jobMaps in ClusterStats, it became worse when you save the returned map in the member field of StressJobFactory. It seems sufficient to avoid them if you have the following methods in ClusterStatus int getNumRunningJobs() Collection<JobStats> getRunningJobStats() The following code looks wrong. Why is an empty JobStats object created? Shouldn't you search from internal JobStats map, and only call listeners if an instance of JobStats is found? Also, the if statement seems redundant. try { //Job is completed notify all the listeners. if (jobStatListeners.size() > 0) { for (StatListener<JobStats> l : jobStatListeners) { JobStats stats = new JobStats(); l.update(stats); } } StressJobFactory.java: Why do we need to add "volatile" to loadStatus? I think the following statement should be Log.debug() instead of Log.info() (and be protected by a check of LOG.isDebugEnabled()): - if (LOG.isDebugEnabled()) { - LOG.info( + LOG.info( System.currentTimeMillis() + " Overloaded is " + Boolean.toString( overloaded) + " incompleteMapTasks " + relOp + " " + OVERLAOD_MAPTASK_MAPSLOT_RATIO + "*mapSlotCapacity" + "(" + incompleteMapTasks + " " + relOp + " " + OVERLAOD_MAPTASK_MAPSLOT_RATIO + "*" + clusterStatus.getMaxMapTasks() + ")"); - } + Misc: Some indentation problem in JobSubmitter.java
        Hide
        rahul k singh added a comment -

        Attaching the first cut of the feature.

        As Job.getJobId returns null in yhadoop 20.1xx . We are using GridmixJob.ORIGNAME as the configuration variable to set the key for job. GridmixJob class the related code.

        Show
        rahul k singh added a comment - Attaching the first cut of the feature. As Job.getJobId returns null in yhadoop 20.1xx . We are using GridmixJob.ORIGNAME as the configuration variable to set the key for job. GridmixJob class the related code.

          People

          • Assignee:
            rahul k singh
            Reporter:
            rahul k singh
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development