Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.10.2
-
None
-
None
Description
TLDR: after TEZ-4388, setDestinationLocalhostName hit NPE as an InputReadErrorEvent was created with null destinationLocalHostName. This is unlikely in prod, we don't use the InputReadErrorEvent.create(...) with 3 parameters.
TestFaultTolerance test becomes flakier recently. It's important to be investigated because a unit test failure could also imply a product bug while handling failure scenarios.
According to surefire process' jstack, it can be reproduced only by TestFaultTolerance.testBasicInputFailureWithoutExitDeadline surefire_jstack.log
"Thread-1355" #1569 prio=5 os_prio=31 tid=0x00007fe76660c800 nid=0x43d07 waiting on condition [0x000070002ab38000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:155) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:142) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:138) at org.apache.tez.test.TestFaultTolerance.testBasicInputFailureWithoutExitDeadline(TestFaultTolerance.java:351) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
this is when it waits for the DAG to finish
Attachments
Attachments
Issue Links
- is caused by
-
TEZ-4338 Tez should consider node information to realize OUTPUT_LOST as early as possible - upstream(mapper) problems
- Resolved
- links to