Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2475

Tez local mode hanging in big testsuite

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0, 0.6.1
    • Fix Version/s: 0.5.4, 0.6.2, 0.8.0-alpha, 0.7.1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      we have a big test suite for lingual, our SQL layer for cascading. We are trying very hard to make it work correctly on Tez, but I am stuck:

      The setup is a huge suite of SQL based tests (6000+), which are being executed in order in local mode. At certain moments the whole process just stops. Nothing gets executed any longer. This is not all the time, but quite often. Note that it is not happening at the same line of code, more at random, which makes it quite complex to debug.

      What I am seeing, is these kind of stacktraces in the middle of the run:

      2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
      java.lang.InterruptedException
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
      at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
      at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)

      This looks like it could be related to the hang, but the hang is not happening immediately afterwards, but some time later.

      I have gone through quite a few JIRAs and saw that there were problems with locks and hanging threads before, which should be fixed, but it still happens.

      I have tried 0.6.1 and 0.7.0. Both show the same behaviour.

      This gist contains a thread dump of a hanging build: https://gist.github.com/fs111/1ee44469bf5cc31e5a52

        Attachments

        1. 2015-05-21_15-55-20_buildLog.log.gz
          1.88 MB
          André Kelpe
        2. TEZ-2475.1.branch6.txt
          11 kB
          Siddharth Seth
        3. TEZ-2475.1.txt
          12 kB
          Siddharth Seth
        4. TEZ-2475.2.branch6.txt
          11 kB
          Siddharth Seth
        5. TEZ-2475.2.incr.branch7.txt
          0.7 kB
          Siddharth Seth
        6. TEZ-2475.2.txt
          11 kB
          Siddharth Seth
        7. TEZ-2475.debug.1.txt
          2 kB
          Siddharth Seth

          Issue Links

            Activity

              People

              • Assignee:
                sseth Siddharth Seth
                Reporter:
                fs111 André Kelpe
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: