Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4348

ShuffleManager should try to report the original exception

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      The idea is the same as in TEZ-4336, this is for unordered codepaths.

      An example with a problem attached as org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt
      this was discovered while I was working on a hive ticket:
      1. qtest failed
      2. there were no obvious hive related error
      3. tons of messages in the logs like below:

      2024-07-26T00:21:36,900  INFO [Fetcher_B {Map_1 -> Reducer_2} #0] impl.ShuffleManager: Map_1 -> Reducer_2: Fetch failed for src: InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0, spillId=-1] InputIdentifier: InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0, spillId=-1], connectFailed: true, local fetch: false, remote fetch failure reported as local failure: false)
      

      4. after placing a log message to ShuffleManager I found the following:

      2024-07-25T03:28:15,352  WARN [Fetcher_B {Map_1 -> Reducer_2} #0] impl.ShuffleManager: Fetch failure
      java.io.IOException: Failed to connect to http://lbodor-MBP16.local:0/mapOutput?job=job_1721903278713_0001&dag=8&reduce=0&map=attempt_1721903278713_0001_8_00_000000_0_10129, #connectionFailures=1
      	at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:166) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
      	at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:121) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
      	at org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:505) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
      	at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:574) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
      	at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:493) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
      	at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:291) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
      	at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
      	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) ~[tez-common-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
      	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) ~[guava-28.2-jre.jar:?]
      	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) ~[guava-28.2-jre.jar:?]
      	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) ~[guava-28.2-jre.jar:?]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_292]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_292]
      	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
      Caused by: java.net.ConnectException: Can't assign requested address (connect failed)
      	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292]
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_292]
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_292]
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_292]
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_292]
      	at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292]
      	at sun.net.NetworkClient.doConnect(NetworkClient.java:175) ~[?:1.8.0_292]
      	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) ~[?:1.8.0_292]
      	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ~[?:1.8.0_292]
      	at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) ~[?:1.8.0_292]
      	at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292]
      	at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292]
      	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226) ~[?:1.8.0_292]
      	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162) ~[?:1.8.0_292]
      	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056) ~[?:1.8.0_292]
      	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) ~[?:1.8.0_292]
      	at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:149) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
      	... 13 more
      

      this eventually led to DAG failure

      the expected behavior is:
      1. log the exception and/or...
      2. report the exception to the AM so it can report it on DAG failure

      Attachments

        Issue Links

          Activity

            People

              abstractdog László Bodor
              abstractdog László Bodor
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: