[PIG-5290] User Cache upload contention can cause job failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.18.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

We recently enabled the User Cache (~~PIG-2672~~) feature and found that occasionally jobs would fail because of contention when uploading JARs into the cache. Although the cache is designed to be fail-safe, i.e. to fall back to normal behavior if anything goes wrong by catching all IOException, the portion of code which closes the output stream is not wrapped within a try statement and thus an exception during the closing of that stream causes the entire job to fail. If multiple jobs are attempting to upload the same JAR failure simultaneously, the contention can cause this close statement to fail.

The current strategy also has two other flaws. First, consider the scenario where job A begins uploading jar X. Job B also needs jar X, sees that the file exists, and launches its tasks. Yet, job A has not yet finished uploading jar X (perhaps it is large). So, the tasks are localizing a half-completed version of jar X. Second, the original design allowed for the same JAR (identical contents) to be shared between jobs even if a different name was used. In ~~PIG-3815~~, however, this ability was removed, and now JARs are only shared if they have the same name.

I propose we solve all of these issues simultaneously by returning to the listStatus based behavior (used prior to ~~PIG-3815~~), but filter out entries ending in .tmp. When uploading, upload to randomNumber.tmp, then once the file is completed, do a rename to the original name of the JAR file. This ensures that incomplete files are never in a location that would be accessed by other jobs, and the only write operation accessing a shared path is a single rename operation.

An alternative design is to use a single canonicalized name for all JAR files (they will still be unique since they are inside of directories based on their SHA1). Upload to a tmp file as previously described, then rename to the canonical name. This removes the need to do a listStatus call; however it will result in classpaths that are human unreadable since the name of the JAR file has been lost. I think it's worth it from a debugging standpoint to go with the first design.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-5290.patch
16/Aug/17 23:45
5 kB
Erik Krogen
PIG-5290-1.patch
07/Sep/17 20:29
5 kB
Erik Krogen

Issue Links

is related to

PIG-3815 Hadoop bug causes to pig to fail silently with jar cache

Closed

PIG-2672 Optimize the use of DistributedCache

Closed

Activity

People

Assignee:: Erik Krogen

Reporter:: Erik Krogen

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Aug/17 15:55

Updated:: 12/Sep/17 22:23

Resolved:: 08/Sep/17 23:06