Affects Version/s: None
Fix Version/s: None
The TabletServer includes a bunch of code to manage reuse of RFile.Reader objects across scans. There are a few reasons for doing this that certainly made sense in the past, including:
- Reducing index reads. If we have already found a location in the RFile index, and we can easily determine that this is the location that we need for the next read, then we can skip doing a binary search in the RFile index. Index caching and hierarchical indexes may improve this aspect to where we don't need this optimization.
- Reducing re-reading of the same data block. Re-reading a block used to mean that HDFS would create a new socket rather than reusing the current socket. Seeking to the same spot HDFS now reuses connections, so this is not as big a deal.
- Reducing namenode operations. Re-opening an RFile requires knowledge of the RFile's size. Even if we have the block locations cached, the file size still includes an HDFS namenode operation. This is something we should probably be caching within the TabletServer, since our RFiles are immutable.
There are also a few problems with reusing RFiles, including:
- Increased TabletServer complexity. We need to manage the resource pool among many scans. Not reusing readers along with ACCUMULO-416 may make it possible to eliminate coordination of RFile.Readers at the TabletServer level altogether.
- Risk of information leakage. RFile.Readers hold state, and we don't want that state passing between scan sessions. Sharing RFile.Readers makes it more likely that a bug in this class exists.
- Risk of interference. RFile.Readers are not thread safe now, but even if they were we wouldn't want separate scan sessions (or maybe even multiple threads in the same scan session) affecting each other through the RFile.Reader. Sharing RFile.Readers also makes it more likely that a bug in this class exists.
The relevant question is what would be the effect on creating a new RFile.Reader anytime we want to read an RFile instead of grabbing one from the pool? ACCUMULO-416 is certainly a prerequisite for this ticket, and we know we need to cache the RFile length. Are there any other prerequisites?