Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.7.1
-
None
-
None
Description
A client had a simple transient failure in polling the JT for job status (which it does for HIVECOUNTERSPULLINTERVAL for each currently running job).
java.io.IOException: Call to HOST/IP:PORT failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) at org.apache.hadoop.ipc.Client.call(Client.java:1110) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at org.apache.hadoop.mapred.$Proxy10.getJobStatus(Unknown Source) at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:1053) at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:1065) at org.apache.hadoop.hive.ql.exec.ExecDriver.progress(ExecDriver.java:351) at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:686) at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:310) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:317) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:490) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
This lead to Hive thinking the running job itself has failed, and it failed the query run, although the running job progressed to completion in the background.
We should not let transient IOExceptions in counter polling cause query termination, and should instead just retry.