First off I'm really happy to see other people trying to improve performance of the HDFSDirectory. So I will offer some reasons as to why I have landed on the current implementation in Blur.
Why Blur doesn't clone the HDFS file handle for clone in Lucene.
- Mainly because since Lucene 4 cloned file handles don't seem to get closed all the time. So I didn't want to have all those objects hanging around for long periods of time and not being closed. Related: Also for those that are interested, Blur has Directory reference counter so that files that are deleted by Lucene stick around long enough for running queries to finish.
Why Blur doesn't use the read[Fully](position,buf,off,len) method instead of the seek plus read[Fully](buf,off,len).
- When accessing the local file system with the call would take a huge amount of time because of some internal setup the Hadoop was doing for every call. This didn't seem to be an issue when using HDFS, but if you start using short-circuit reads it might become a problem. I have not tested this for 6 months, so this may have been improved in the newer versions of Hadoop.
Why Blur uses readFully versus read.
- Laziness? Not sure, I'm sure that I thought that a single call to seek + read from the filesystem would be better (even if it was more operations) than multiple calls with multiple seeks + reads. Perhaps though it would be better to not use the readFully as you all are discussing because of the sync call.
How would I really like to implement it?
- I would like to implement the file access system as a pool of file handles for each file. So that each file would have up to N (configured by default to 10 or something like that) file handles open and all the accesses from the base file objects and clones would check out the handle and release it when finished. So that way there is some limit to the number of handles but some parallel accesses are allowed.
Hope this helps to explain why Blur has the implementation that is does.