[HIVE-17941] Don't Re-Create RunningJob Client During Status Checks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.3.1, 3.0.0
Fix Version/s: None
Component/s: HiveServer2
Labels:
None

Description

org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper

while (!rj.isComplete()) {
  ...
        RunningJob newRj = jc.getJob(rj.getID());
        if (newRj == null) {
          // under exceptional load, hadoop may not be able to look up status
          // of finished jobs (because it has purged them from memory). From
          // hive's perspective - it's equivalent to the job having failed.
          // So raise a meaningful exception
          throw new IOException("Could not find status of job:" + rj.getID());
        } else {
          th.setRunningJob(newRj);
          rj = newRj;
        }
      }
  ...
}

https://github.com/apache/hive/blob/a9f25c0e7ad3f81a9f00f601947a161516e33f1b/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/HadoopJobExecHelper.java#L295-L306

Every time we loop here for a status update, we are rebuilding the RunningJob object to test if the Job information is still loaded in YARN. Rebuilding this RunningJob object is not trivial because it requires that we re-load and parse the Job Configuration XML file every time.

Outdated Stacktrace But Same Idea Holds

at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:120)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1924)
        at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1877)
        at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1785)
        at org.apache.hadoop.conf.Configuration.get(Configuration.java:712)
        at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1951)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:398)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:388)
        at org.apache.hadoop.mapred.JobClient$NetworkedJob.<init>(JobClient.java:174)
        at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:655)
        at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:668)
        at org.apache.hadoop.hive.ql.exec.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:282)
        at org.apache.hadoop.hive.ql.exec.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:532)

Maybe we can be use isRetired() instead for this particular check. We also probably need to be better about checking the return value from any of the RunningJob methods if it's the case that they can fail/go-away at any time if YARN purges the information. It seems that perhaps this was an attempt to detect a purged job before exercising the RunningJob object... even though it can go bad at any point.

https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapred/RunningJob.html

Attachments

Issue Links

is related to

HIVE-4009 CLI Tests fail randomly due to MapReduce LocalJobRunner race condition

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: David Mollitor

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Oct/17 21:04

Updated:: 30/Jul/21 21:45