Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
Impala 3.4.0
-
ghx-label-11
Description
The file handle cache can contain multiple file handles per filename. Currently it uses a std::multimap, where the file handles for each filename are a contiguous set of entries. A lookup will find the beginning of that range and then iterate through it to find a free one.
A multimap is implemented as a red-black tree with O(log(N)) lookup, so we should be able to improve this by using a hashtable-based structure such as unordered_map/unordered_multimap with O(1) lookup.
Another optimization would be to add an intermediary structure for each filename and hold all the file handles for that file name in a linked list. Lookup would find this intermediary structure by looking up the filename, then it would iterate. In the current method, the key/value pair for each file handle must store a copy of the filename string as the key, even for duplicates. With the intermediary structure, it would store the filename once per unique filename.
It also looks like the LRU list would benefit from being a Boost intrusive list (https://www.boost.org/doc/libs/1_64_0/doc/html/intrusive.html). Every file handle is always in the LRU list, so a std::list has a higher memory overhead and requires more memory accesses. It also complicates the code, because the FileHandleEntry needs to store a LruListType::iterator to its location in the LRU list.
These optimizations are low priority, but they provide good ramp-up for some C++ concepts/APIs.