Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
0.12.0
-
None
-
None
-
None
Description
DistributedCache is shared across multiple jobs, if the hdfs file name is the same.
We need to make sure Hive put the same file into the same location every time and do not overwrite if the file content is the same.
We can achieve 2 different results:
A1. Files added with the same name, timestamp, and md5 in the same session will have a single copy in distributed cache.
A2. Filed added with the same name, timestamp, and md5 will have a single copy in distributed cache.
A2 has a bigger benefit in sharing but may raise a question on when Hive should clean it up in hdfs.