[CASSANDRA-3840] Use java.io.tmpdir as default output location for BulkRecordWriter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 1.1.0
Component/s: None
Labels:
- bulkloader

Description

BulkRecordWriter uses the value of the property mapreduce.output.bulkoutputformat.localdir if set, defaulting to value of mapred.local.dir if the former is not set.

However, on a typical production system, mapred.local.dir is set to a list of directories. This leads to BulkOutputFormat writing to silly paths such as

/dir1/,dir2,/dir3,KeySpaceName/CFName

This has two effects:

1) Directory is not removed when job is finished, leading to disk space management issues.

2) If a new job is run against same keyspacename and CF, it tries to load old data + new data.

Better to use System.getProperty("java.io.tmpdir"), as that is set to an attempt-specific temporary directory which is cleaned after the job finishes. See http://hadoop.apache.org/common/docs/current/mapred_tutorial.html, under "Directory Structure".

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

java.io.tmpdir.patch
02/Feb/12 15:34
0.9 kB
Erik Forsberg

Activity

People

Assignee:: Erik Forsberg

Reporter:: Erik Forsberg

Authors:: Erik Forsberg

Reviewers:: Brandon Williams

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 02/Feb/12 15:32

Updated:: 16/Apr/19 09:32

Resolved:: 02/Feb/12 16:10