Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
The idea is the same as in TEZ-4336, this is for unordered codepaths.
An example with a problem attached as org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt
this was discovered while I was working on a hive ticket:
1. qtest failed
2. there were no obvious hive related error
3. tons of messages in the logs like below:
2024-07-26T00:21:36,900 INFO [Fetcher_B {Map_1 -> Reducer_2} #0] impl.ShuffleManager: Map_1 -> Reducer_2: Fetch failed for src: InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0, spillId=-1] InputIdentifier: InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0, spillId=-1], connectFailed: true, local fetch: false, remote fetch failure reported as local failure: false)
4. after placing a log message to ShuffleManager I found the following:
2024-07-25T03:28:15,352 WARN [Fetcher_B {Map_1 -> Reducer_2} #0] impl.ShuffleManager: Fetch failure java.io.IOException: Failed to connect to http://lbodor-MBP16.local:0/mapOutput?job=job_1721903278713_0001&dag=8&reduce=0&map=attempt_1721903278713_0001_8_00_000000_0_10129, #connectionFailures=1 at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:166) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT] at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:121) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT] at org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:505) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT] at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:574) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT] at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:493) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT] at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:291) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT] at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT] at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) ~[tez-common-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT] at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) ~[guava-28.2-jre.jar:?] at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) ~[guava-28.2-jre.jar:?] at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) ~[guava-28.2-jre.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_292] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292] Caused by: java.net.ConnectException: Can't assign requested address (connect failed) at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_292] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_292] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_292] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_292] at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292] at sun.net.NetworkClient.doConnect(NetworkClient.java:175) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) ~[?:1.8.0_292] at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:149) ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT] ... 13 more
this eventually led to DAG failure
the expected behavior is:
1. log the exception and/or...
2. report the exception to the AM so it can report it on DAG failure
Attachments
Attachments
Issue Links
- relates to
-
TEZ-4336 ShuffleScheduler should try to report the original exception (when shuffle becomes unhealthy)
- Resolved