[MAPREDUCE-7015] Possible race condition in JHS if the job is not loaded - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.1.0, 3.0.1, 2.10.0
Component/s: jobhistoryserver
Labels:
None

Hadoop Flags:

Reviewed

Description

There could be a race condition inside JHS. In our build environment, TestMRJobClient.testJobClient() failed with this exception:

ava.io.FileNotFoundException: File does not exist: hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
	at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
	at org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
	at org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)

Root cause:
1. MapReduce job completes
2. CLI calls cluster.getJob(jobid)
3. The job is finished and the client side gets redirected to JHS
4. The job data is missing from CachedHistoryStorage so JHS tries to find the job
5. First it scans the intermediate directory and finds the job
6. The call moveToDone() is scheduled for execution on a separate thread inside moveToDoneExecutor and it starts to run immediately
7. RPC invocation returns with the path pointing to /tmp/hadoop-yarn/staging/history/done_intermediate
8. The call to moveToDone() completes which moves the contents of done_intermediate to done
9. Hadoop CLI tries to download the config file from done_intermediate but it's no longer there

Usually step #6 is slow enough to complete after #7, but sometimes it's faster, causing this race condition.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-7015-001.patch
17/Jan/18 11:34
6 kB
Peter Bacsko
MAPREDUCE-7015-POC01.patch
01/Dec/17 15:14
6 kB
Peter Bacsko
MAPREDUCE-7015-POC02.patch
05/Jan/18 13:32
4 kB
Peter Bacsko

Issue Links

is related to

MAPREDUCE-7131 Job History Server has race condition where it moves files from intermediate to finished but thinks file is in intermediate

Resolved

relates to

MAPREDUCE-7020 Task timeout in uber mode can crash AM

Resolved

Activity

People

Assignee:: Peter Bacsko

Reporter:: Peter Bacsko

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Nov/17 15:06

Updated:: 28/Aug/18 17:44

Resolved:: 24/Jan/18 20:55