Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-518

Have end to end tests based on MiniMRCluster to verify correct behaviour of slot reclamation by queues.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We should have a test that submits long running jobs to different queues one after the other, and ensures that queues get required capacity or get back taken-away capacity after killing tasks within the specified amount of time.

        Issue Links

          Activity

          Hide
          Allen Wittenauer added a comment -

          Stale

          Show
          Allen Wittenauer added a comment - Stale
          Hide
          Vinod Kumar Vavilapalli added a comment -

          While testing, I came across the following problem with ReclaimCapacity functionality.

          • When reclaim-capacity interval is sufficiently small (1 or 2 seconds, default is 5), I see a lot of the following exceptions in the log. This is a fatal exception and affects one iteration of reclaim capacity functionality. The reason for this is that TaskStatus only gets populated when a TT reports back launching of a task. But we don't have null checks for TaskStatus in TaskSchedulingMgr.killTasksFromQueue, thus causing this error. This is not visible when reclaim-interval is not small enough, as within that much time, TTs report back and TaskStatus will never be observed to be null.
             09/01/21 12:14:35 ERROR mapred.CapacityTaskScheduler: Error in redistributing capacity:
             java.lang.NullPointerException
                  at java.util.TreeMap.getEntry(TreeMap.java:341)
                  at java.util.TreeMap.get(TreeMap.java:272)
                  at org.apache.hadoop.mapred.TaskInProgress.killTask(TaskInProgress.java:741)
                  at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.killTasksFromJob(CapacityTaskScheduler.java:878)
                  at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasksFromQueue(CapacityTaskScheduler.java:612)
                  at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasks(CapacityTaskScheduler.java:594)
                  at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.reclaimCapacity(CapacityTaskScheduler.java:531)
                  at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$800(CapacityTaskScheduler.java:362)
                  at org.apache.hadoop.mapred.CapacityTaskScheduler.reclaimCapacity(CapacityTaskScheduler.java:1216)
                  at org.apache.hadoop.mapred.CapacityTaskScheduler$ReclaimCapacity.run(CapacityTaskScheduler.java:1001)
                  at java.lang.Thread.run(Thread.java:636)
             

          Inserting null checks prevents this.

          Show
          Vinod Kumar Vavilapalli added a comment - While testing, I came across the following problem with ReclaimCapacity functionality. When reclaim-capacity interval is sufficiently small (1 or 2 seconds, default is 5), I see a lot of the following exceptions in the log. This is a fatal exception and affects one iteration of reclaim capacity functionality. The reason for this is that TaskStatus only gets populated when a TT reports back launching of a task. But we don't have null checks for TaskStatus in TaskSchedulingMgr.killTasksFromQueue, thus causing this error. This is not visible when reclaim-interval is not small enough, as within that much time, TTs report back and TaskStatus will never be observed to be null. 09/01/21 12:14:35 ERROR mapred.CapacityTaskScheduler: Error in redistributing capacity: java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:341) at java.util.TreeMap.get(TreeMap.java:272) at org.apache.hadoop.mapred.TaskInProgress.killTask(TaskInProgress.java:741) at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.killTasksFromJob(CapacityTaskScheduler.java:878) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasksFromQueue(CapacityTaskScheduler.java:612) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasks(CapacityTaskScheduler.java:594) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.reclaimCapacity(CapacityTaskScheduler.java:531) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$800(CapacityTaskScheduler.java:362) at org.apache.hadoop.mapred.CapacityTaskScheduler.reclaimCapacity(CapacityTaskScheduler.java:1216) at org.apache.hadoop.mapred.CapacityTaskScheduler$ReclaimCapacity.run(CapacityTaskScheduler.java:1001) at java.lang. Thread .run( Thread .java:636) Inserting null checks prevents this.

            People

            • Assignee:
              Vinod Kumar Vavilapalli
              Reporter:
              Vinod Kumar Vavilapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development