[MAPREDUCE-7331] Make temporary directory used by FileOutputCommitter configurable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: mrv2
Labels:
None
Environment:

CDH 6.2.1 Hadoop 3.0.0

Description

Spark SQL applications uses FileOutputCommitter to commit and merge its files under a table directory. The hardcoded PENDING_DIR_NAME = _temporary directory results in multiple application using the same temporary directory. This casues unwanted results of one application interfering with other applications temporary files. Also one application ending up deleting temporary files of other. There is no way right now for applications to have there unique path to store the temporary files to avoid any interference from other totally independent applications. I think the temporary directory being used by FileOutputCommitter should be made configurable to let the caller call with with its own unique value as per the requirement and avoid it getting deleted or overwritten by other applications

Something like:

public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
public static final String PENDING_DIR_NAME_DEFAULT =
"mapreduce.fileoutputcommitter.tempdir";

This can be used very efficiently by Spark applications to handle even stage failures where temporary directories from previous attempts cause problem and can help in so many situations.

Attachments

Issue Links

is duplicated by

MAPREDUCE-7378 An error occurred while concurrently writing to a path

Resolved

MAPREDUCE-7366 FileOutputCommitter Enable Concurrent Writes

Resolved

is related to

MAPREDUCE-7341 Add a task-manifest output committer for Azure and GCS

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Bimalendu Choudhary

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Mar/21 15:54

Updated:: 11/Oct/24 14:14