|
545 is certainly more general and could handle all the cases. I looked at it briefly before doing this version and was concerned by the pervasiveness of its changes for the simple use case of loading selected fields. It seemed it might break my current code (e.g., elimination of IndexReader.document(int n) and Fieldable as new type of object on Documents). I thought a version that had no effect other than adding the new functionality would be safer and might be useful to others
I just took a closer look at 545. It seems like a good implementation. FYI, I can't find the definitions of FieldSelector, SetBaseFieldSelector or LoadFirstFielSelector? I would suggest leaving IndexReader.document(int) – there does not a appear to be a reason to take it out. Although it is also not difficult to add the extra null parameter, backward compatibility is a virtue. Presumably it would be easy to write a field selector that takes the list of fields to load (loading no others) and therefore get the API I need (for the cost of one extra allocation). IndexReader.document(int n) is still in there. All prior APIs work and the introduction of Fieldable is a drop in replacement for Field and should not break any existing APIs.
You are right, however, about the missing files. I assumed SVN diff would get them. I will add them. I think we could easily combine the two issues to get the best of both worlds. Also, I added a skipChars mechanism which I think improves the case of uncompressed chars. You're right about IndexReader.document(int), although it appears you removed (package api) FieldsReader.doc(int). I've been reading the patch file (and now the new files) rather than applying the patch since it affects some of the same files as another patch of mine (LUCENE-362).
Re. skipChars() do you have a newer version that is not attached? The version in your patch file takes, I believe, precisely the same number of instructions as mine. It's hard to see how to do it better without changing the index representation (to store byte count instead of char count for strings, which would require an extra seek to write a string). 545 is a good improvement. I was initiually put off by the scope of it. We don't need this issue (558). Is there interest in committing 545? There is one potentially important benefit of this approach over
The once case where this is done is with ParallelReader. This patch only accesses the reader(s) that contain the field(s) to be loaded, while By extending the FieldSelector interface a little, the optimization could be maintained in the more general I am all for extending the FieldSelector. I was on the fence about adding things like the state of the booleans for stored, indexed, term vector, etc. to the interface. In the end, I opted for the simplest, thinking others would add in their ideas.
Chuck, perhaps we can synch up our two patches. If you are up to it, add what you think you want on FieldSelector on Issue 545, or we can merge the two and open a new issue, closing out these two. Since we have a little bit of time before the 2.0 release, I think we can get this right to everyone's satisfaction. As for skipChars, I missed your version ( I guess we both need to learn to read patches better!). I agree that we can't really achieve an optimum solution unless we change the file format, and I don't think that is really warranted yet, although I know the GData thread has proposed some format changes too. Maybe we could sweep them both in at the same time. Just a thought. Grant, I think that's a great idea. I'll look at adding the extension to support the reader optimization for ParallelReader. Essentially it will provide a means for FieldSelectors to declare the complete list of fields that will ever be loaded as an optional operation. If this is declared, then ParallelReader (for example) can access only relevant readers.
There is another extension I'll add unless there are objections. The idea is to extend the lazy loading to support streaming through readerValue() and a new streamValue() (the former for uncompressed String fields and the latter for compressed String and binary fields). This will support getting a reader or stream to obtain the field value rather than reading it all into a String or byte[]. This seems like a huge advantage in many applications (e.g., my current one). It would be an upward incompatibility to support readerValue() this way (since it would no longer be true that exactly one of stringValue(), binaryValue() and readerValue() is non-null). So it could be a different method, or limited to a new Fieldable subtype. The reader returned will be full function – e.g., it is easy to support arbitrary mark() and reset(). It looks like
Please go a ahead an close this issue. The ParallelReader extension was incorporated into
closing per reporter as the spirit of the issue has already been captured in
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
We could, of course, merge the two, and have this patch hide the details of creating the FieldSelector.
Just a thought...