Private non-Archive Files' size add twice in Distributed Cache directory size calculation. Private non-Archive Files list is passed in by "-files" command line option. The Distributed Cache directory size is used to check whether the total cache files size exceed the cache size limitation, the default cache size limitation is 10G.
I add log in addCacheInfoUpdate and setSize in TrackerDistributedCacheManager.java.
I use the following command to test:
hadoop jar ./wordcount.jar org.apache.hadoop.examples.WordCount -files hdfs://host:8022/tmp/zxu/WordCount.java,hdfs://host:8022/tmp/zxu/wordcount.jar /tmp/zxu/test_in/ /tmp/zxu/test_out
to add two files into distributed cache:WordCount.java and wordcount.jar.
WordCount.java file size is 2395 byes and wordcount.jar file size is 3865 bytes. The total should be 6260.
The log show these files size added twice:
add one time before download to local node and add second time after download to local node, so total file number becomes 4 instead of 2:
addCacheInfoUpdate size: 6260 num: 2 baseDir: /mapred/local
addCacheInfoUpdate size: 8683 num: 3 baseDir: /mapred/local
addCacheInfoUpdate size: 12588 num: 4 baseDir: /mapred/local
In the code, for Private non-Archive File, the first time we add file size is at
The second time we add file size is at
The fix is not to add the file size for for Private non-Archive File after download(downloadCacheObject).