Issue Details (XML | Word | Printable)

Key: LUCENE-558
Type: New Feature New Feature
Status: Closed Closed
Resolution: Duplicate
Priority: Major Major
Assignee: Unassigned
Reporter: Chuck Williams
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Selective field loading

Created: 27/Apr/06 05:29 PM   Updated: 11/Jun/06 04:03 AM
Return to search
Component/s: Index
Affects Version/s: 2.0.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LuceneTrunk.patch 2006-04-27 05:30 PM Chuck Williams 18 kB
Environment: All
Issue Links:
Reference
 

Resolution Date: 11/Jun/06 04:03 AM


 Description  « Hide
Provides a new api, IndexReader.document(int doc, String[] fields). A document containing only the specified fields is created. The other fields of the document are not loaded, although unfortunately uncompressed strings still have to be scanned because the length information in the index is for UTF-8 encoded chars and not bytes. This is useful for applications that need quick access to a small subset of the fields. It can be used in conjunction with or for some uses instead of ParallelReader.

This is a much smaller change for a simpler use case than Lucene-545. No existing API's are affected.

All the tests pass and new tests are added to verify the feature.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Grant Ingersoll added a comment - 27/Apr/06 07:44 PM
This is pretty much what I started out with when I first started working on Lazy/Selective Field loading and I think it is a little bit simpler than 545 but is easily handled by the code in 545. Discussions on Lazy/Selective field loading indicated that people wanted more control over the process.

We could, of course, merge the two, and have this patch hide the details of creating the FieldSelector.

Just a thought...


Chuck Williams added a comment - 28/Apr/06 02:10 AM
545 is certainly more general and could handle all the cases. I looked at it briefly before doing this version and was concerned by the pervasiveness of its changes for the simple use case of loading selected fields. It seemed it might break my current code (e.g., elimination of IndexReader.document(int n) and Fieldable as new type of object on Documents). I thought a version that had no effect other than adding the new functionality would be safer and might be useful to others

I just took a closer look at 545. It seems like a good implementation. FYI, I can't find the definitions of FieldSelector, SetBaseFieldSelector or LoadFirstFielSelector? I would suggest leaving IndexReader.document(int) – there does not a appear to be a reason to take it out. Although it is also not difficult to add the extra null parameter, backward compatibility is a virtue. Presumably it would be easy to write a field selector that takes the list of fields to load (loading no others) and therefore get the API I need (for the cost of one extra allocation).


Grant Ingersoll added a comment - 28/Apr/06 02:22 AM
IndexReader.document(int n) is still in there. All prior APIs work and the introduction of Fieldable is a drop in replacement for Field and should not break any existing APIs.

You are right, however, about the missing files. I assumed SVN diff would get them. I will add them.

I think we could easily combine the two issues to get the best of both worlds.

Also, I added a skipChars mechanism which I think improves the case of uncompressed chars.


Chuck Williams added a comment - 28/Apr/06 05:01 AM
You're right about IndexReader.document(int), although it appears you removed (package api) FieldsReader.doc(int). I've been reading the patch file (and now the new files) rather than applying the patch since it affects some of the same files as another patch of mine (LUCENE-362).

Re. skipChars() do you have a newer version that is not attached? The version in your patch file takes, I believe, precisely the same number of instructions as mine. It's hard to see how to do it better without changing the index representation (to store byte count instead of char count for strings, which would require an extra seek to write a string).

545 is a good improvement. I was initiually put off by the scope of it. We don't need this issue (558). Is there interest in committing 545?


Chuck Williams added a comment - 28/Apr/06 07:34 AM
There is one potentially important benefit of this approach over LUCENE-545. By having the narrower more concrete API (list of fields to load vs. general selector for a variety of loading modes), it is possible to introduce optimizations into readers.

The once case where this is done is with ParallelReader. This patch only accesses the reader(s) that contain the field(s) to be loaded, while LUCENE-545 reads the fields from all readers. That can be a significant performance difference, although if using ParallelReaders the application should be able to arrange for this optimization itself.

By extending the FieldSelector interface a little, the optimization could be maintained in the more general LUCENE-545.


Grant Ingersoll added a comment - 28/Apr/06 06:00 PM
I am all for extending the FieldSelector. I was on the fence about adding things like the state of the booleans for stored, indexed, term vector, etc. to the interface. In the end, I opted for the simplest, thinking others would add in their ideas.

Chuck, perhaps we can synch up our two patches. If you are up to it, add what you think you want on FieldSelector on Issue 545, or we can merge the two and open a new issue, closing out these two. Since we have a little bit of time before the 2.0 release, I think we can get this right to everyone's satisfaction.

As for skipChars, I missed your version ( I guess we both need to learn to read patches better!). I agree that we can't really achieve an optimum solution unless we change the file format, and I don't think that is really warranted yet, although I know the GData thread has proposed some format changes too. Maybe we could sweep them both in at the same time. Just a thought.


Chuck Williams added a comment - 29/Apr/06 07:25 AM
Grant, I think that's a great idea. I'll look at adding the extension to support the reader optimization for ParallelReader. Essentially it will provide a means for FieldSelectors to declare the complete list of fields that will ever be loaded as an optional operation. If this is declared, then ParallelReader (for example) can access only relevant readers.

There is another extension I'll add unless there are objections. The idea is to extend the lazy loading to support streaming through readerValue() and a new streamValue() (the former for uncompressed String fields and the latter for compressed String and binary fields). This will support getting a reader or stream to obtain the field value rather than reading it all into a String or byte[]. This seems like a huge advantage in many applications (e.g., my current one).

It would be an upward incompatibility to support readerValue() this way (since it would no longer be true that exactly one of stringValue(), binaryValue() and readerValue() is non-null). So it could be a different method, or limited to a new Fieldable subtype.

The reader returned will be full function – e.g., it is easy to support arbitrary mark() and reset().


Otis Gospodnetic added a comment - 10/Jun/06 11:52 AM
It looks like LUCENE-545 solved the lazy field load issue, but you want to extend it some more, so I'll leave this open. If this is not true, we can close this.

Chuck Williams added a comment - 11/Jun/06 12:58 AM
Please go a ahead an close this issue. The ParallelReader extension was incorporated into LUCENE-545. The reader-valued lazy fields dropped off my radar screen and could always be opened as a new patch at a later date.

Hoss Man added a comment - 11/Jun/06 04:03 AM
closing per reporter as the spirit of the issue has already been captured in LUCENE-545