Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23792

[LLAP] Long continuous running job degrade performance of LLAP because of leaked shuffle manager threads

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • llap, Query Processor, Tez
    • None
    • Ubuntu 18.04

      Hadoop 3.1.1

      TEZ: 0.9.1

      HIve : 3.1.0

      JDK: 1.8

       

    Description

      [Test Case/Reproduction]

      Run TPCH Q19 on 10 Gigs data in infinite loop and disable result caching 

      [Observation]

      On LLAP server I see a strange behaviour continuous increase in Threads.Although query will keep running but with time performance gets degrade 

      [Analysis]

      I took multiple thread-dumps at different intervals to figure out which category of threads causing this issue, and the culprit thread is tez-shuffle manager

      .m2/org/apache/tez/tez-runtime-library/0.9.1/tez-runtime-library-0.9.1-sources.jar!/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java:324

      try {
      while ((runningFetchers.size() >= numFetchers || pendingHosts.isEmpty())
      && numCompletedInputs.get() < numInputs)

      Unknown macro: { inputContext.notifyProgress(); boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); }

      } finally

      Unknown macro: { lock.unlock(); }

       

      [Stack Trace of culprit thread]

      threadId:Thread 16661 - state:BLOCKED
      stackTrace:

      • sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
      • java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
      • java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, java.util.concurrent.TimeUnit) @bci=97, line=2163 (Compiled frame)
      • org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$RunShuffleCallable.callInternal() @bci=125, line=327 (Compiled frame)
      • org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$RunShuffleCallable.callInternal() @bci=1, line=311 (Compiled frame)
      • org.apache.tez.common.CallableWithNdc.call() @bci=8, line=36 (Compiled frame)
      • com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly() @bci=18, line=108 (Compiled frame)
      • com.google.common.util.concurrent.InterruptibleTask.run() @bci=16, line=41 (Compiled frame)
      • com.google.common.util.concurrent.TrustedListenableFutureTask.run() @bci=10, line=77 (Compiled frame)
      • java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)
      • java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
      • java.lang.Thread.run() @bci=11, line=748 (Compiled frame)

       

      Attachments

        1. Screenshot from 2020-07-01 17-43-57.png
          77 kB
          manoj
        2. t3.dump
          464 kB
          manoj
        3. tdump.pdf
          1.68 MB
          manoj

        Activity

          People

            Unassigned Unassigned
            manoj.red.hat manoj
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: