Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.8.1
-
None
-
None
Description
Background:
The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an error when trying to copy in a log that already existed. Now it re-names the file with copy_1 so the file always goes into hdfs.
Original discussion:
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E
Issue:
There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice. Using OVERWRITE will cause other logs in the table/partition to be deleted.
Example:
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
Result:
test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2
test_b data was inserted 3 times, which is not the desired behavior in this instance.
Proposal:
Add IF NOT EXISTS flag to indicate copy semantics. If the the log file does not exist in the table/partition, the log would go in normally. If the log does exist in the table/partition hive would return an error and return an exit code.
Proposed HiveQL Example:
LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')