Details
-
Improvement
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
Description
BulkRecordWriter uses the value of the property mapreduce.output.bulkoutputformat.localdir if set, defaulting to value of mapred.local.dir if the former is not set.
However, on a typical production system, mapred.local.dir is set to a list of directories. This leads to BulkOutputFormat writing to silly paths such as
/dir1/,dir2,/dir3,KeySpaceName/CFName
This has two effects:
1) Directory is not removed when job is finished, leading to disk space management issues.
2) If a new job is run against same keyspacename and CF, it tries to load old data + new data.
Better to use System.getProperty("java.io.tmpdir"), as that is set to an attempt-specific temporary directory which is cleaned after the job finishes. See http://hadoop.apache.org/common/docs/current/mapred_tutorial.html, under "Directory Structure".