Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: tez-branch
    • Fix Version/s: tez-branch
    • Component/s: tez
    • Labels:
      None

      Description

      Currently, Pig reuses AMs via TezSession, but they are not shut down when Pig exits. There are two problems that I noticed with this-

      1. Tez jobs are not marked as finished until TezSessions are expired after timeout. Since they occupy task slots, it blocks submitting jobs.
      2. ant clean test-tez leaves orphan processes (DAGAppMaster).

      Ideally, TezSession should be kept alive while Pig runs but tore down when Pig exits.

      1. PIG-3602-3.patch
        5 kB
        Rohini Palaniswamy
      2. PIG-3602-2.patch
        5 kB
        Rohini Palaniswamy
      3. unit_test.txt
        36 kB
        Cheolsoo Park
      4. PIG-3602-1.patch
        3 kB
        Rohini Palaniswamy

        Activity

        Hide
        Rohini Palaniswamy added a comment -

        Not making it part of the HangingJobKiller shutdown hook as we want to kill sessions at shutdown always. PIG-3486 attempts to remove the shutdown hook if the job completed.

        Show
        Rohini Palaniswamy added a comment - Not making it part of the HangingJobKiller shutdown hook as we want to kill sessions at shutdown always. PIG-3486 attempts to remove the shutdown hook if the job completed.
        Hide
        Cheolsoo Park added a comment -

        Rohini Palaniswamy, not sure what's going on, but "ant test-tez" hangs after TestCombiner with the patch.

        I am attaching the thread dump that I took on my laptop. I see the following stack trace, so it seems related to the shutdown hook-

            [junit] "main" prio=5 tid=7fb11f800800 nid=0x1031f3000 in Object.wait() [1031f2000]
            [junit]    java.lang.Thread.State: WAITING (on object monitor)
            [junit]     at java.lang.Object.wait(Native Method)
            [junit]     - waiting on <788320a30> (a org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1)
            [junit]     at java.lang.Thread.join(Thread.java:1225)
            [junit]     - locked <788320a30> (a org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1)
            [junit]     at java.lang.Thread.join(Thread.java:1278)
            [junit]     at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79)
            [junit]     at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24)
            [junit]     at java.lang.Shutdown.runHooks(Shutdown.java:79)
            [junit]     at java.lang.Shutdown.sequence(Shutdown.java:123)
            [junit]     at java.lang.Shutdown.exit(Shutdown.java:168)
            [junit]     - locked <7faf9fa58> (a java.lang.Class for java.lang.Shutdown)
            [junit]     at java.lang.Runtime.exit(Runtime.java:90)
            [junit]     at java.lang.System.exit(System.java:920)
            [junit]     at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:912)
        
        Show
        Cheolsoo Park added a comment - Rohini Palaniswamy , not sure what's going on, but "ant test-tez" hangs after TestCombiner with the patch. I am attaching the thread dump that I took on my laptop. I see the following stack trace, so it seems related to the shutdown hook- [junit] "main" prio=5 tid=7fb11f800800 nid=0x1031f3000 in Object .wait() [1031f2000] [junit] java.lang. Thread .State: WAITING (on object monitor) [junit] at java.lang. Object .wait(Native Method) [junit] - waiting on <788320a30> (a org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1) [junit] at java.lang. Thread .join( Thread .java:1225) [junit] - locked <788320a30> (a org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1) [junit] at java.lang. Thread .join( Thread .java:1278) [junit] at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79) [junit] at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24) [junit] at java.lang.Shutdown.runHooks(Shutdown.java:79) [junit] at java.lang.Shutdown.sequence(Shutdown.java:123) [junit] at java.lang.Shutdown.exit(Shutdown.java:168) [junit] - locked <7faf9fa58> (a java.lang. Class for java.lang.Shutdown) [junit] at java.lang. Runtime .exit( Runtime .java:90) [junit] at java.lang. System .exit( System .java:920) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:912)
        Hide
        Rohini Palaniswamy added a comment -

        Upgraded my MAC two days back and minicluster is not working. So I only ran it directly against the cluster and did not run ant test-tez which I should have done. Sorry about that and thanks for catching it.

        The actual problem thread is below where if the request to RM does not go through it sleeps. With the default retry interval being in minutes to support RM HA and rolling upgrade, this will hang for a lot of time. Will fix it to timeout quickly if not able to stop.

        "Thread-511" prio=5 tid=7f8ddb12c000 nid=0x11666f000 waiting on condition [11666e000]
           java.lang.Thread.State: TIMED_WAITING (sleeping)
                at java.lang.Thread.sleep(Native Method)
                at org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43)
                at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:150)
                at com.sun.proxy.$Proxy79.getApplicationReport(Unknown Source)
                at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:195)
                at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:590)
                at org.apache.tez.client.TezSession.stop(TezSession.java:210)
                - locked <7883c9928> (a org.apache.tez.client.TezSession)
                at org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:179)
                - locked <7883c3818> (a java.util.ArrayList)
                at org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:51)
        
        Show
        Rohini Palaniswamy added a comment - Upgraded my MAC two days back and minicluster is not working. So I only ran it directly against the cluster and did not run ant test-tez which I should have done. Sorry about that and thanks for catching it. The actual problem thread is below where if the request to RM does not go through it sleeps. With the default retry interval being in minutes to support RM HA and rolling upgrade, this will hang for a lot of time. Will fix it to timeout quickly if not able to stop. " Thread -511" prio=5 tid=7f8ddb12c000 nid=0x11666f000 waiting on condition [11666e000] java.lang. Thread .State: TIMED_WAITING (sleeping) at java.lang. Thread .sleep(Native Method) at org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:150) at com.sun.proxy.$Proxy79.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:195) at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:590) at org.apache.tez.client.TezSession.stop(TezSession.java:210) - locked <7883c9928> (a org.apache.tez.client.TezSession) at org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:179) - locked <7883c3818> (a java.util.ArrayList) at org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:51)
        Hide
        Rohini Palaniswamy added a comment -

        This patch stops the sessions before minicluster is shut down.

        Show
        Rohini Palaniswamy added a comment - This patch stops the sessions before minicluster is shut down.
        Hide
        Cheolsoo Park added a comment -

        +1. Works like a charm. Thanks!

        One minor comment. We can break out of the loop after stopping the session in freeSession() and stopSession(), can't we?

        Show
        Cheolsoo Park added a comment - +1. Works like a charm. Thanks! One minor comment. We can break out of the loop after stopping the session in freeSession() and stopSession(), can't we?
        Hide
        Rohini Palaniswamy added a comment -

        Good catch. Added a break statement in both methods

        Show
        Rohini Palaniswamy added a comment - Good catch. Added a break statement in both methods
        Hide
        Rohini Palaniswamy added a comment -

        Committed to tez branch. Thanks Cheolsoo for the review and catching the bug by trying it out.

        Show
        Rohini Palaniswamy added a comment - Committed to tez branch. Thanks Cheolsoo for the review and catching the bug by trying it out.

          People

          • Assignee:
            Rohini Palaniswamy
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development