Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.
- is duplicated by
MAPREDUCE-865 harchive: Reduce the number of open calls to _index and _masterindex
- is related to
HADOOP-9757 Har metadata cache can grow without limit