Issue Details (XML | Word | Printable)

Key: LUCENE-545
Type: New Feature New Feature
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Grant Ingersoll
Reporter: Grant Ingersoll
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Field Selection and Lazy Field Loading

Created: 15/Apr/06 09:49 PM   Updated: 11/Jun/06 04:01 AM
Return to search
Component/s: Store
Affects Version/s: 2.0.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works fieldSelectorPatch.txt 2006-04-16 03:57 AM Grant Ingersoll 109 kB
GZip Archive Licensed for inclusion in ASF works LazyFields.tar.gz 2006-05-03 04:30 PM Chuck Williams 30 kB
GZip Archive Licensed for inclusion in ASF works newFiles.tar.gz 2006-04-28 02:29 AM Grant Ingersoll 5 kB
Issue Links:
Reference
 

Resolution Date: 10/Jun/06 08:40 AM


 Description  « Hide
The patch to come shortly implements a Field Selection and Lazy Loading mechanism for Document loading on the IndexReader.

It introduces a FieldSelector interface that defines the accept method:
FieldSelectorResult accept(String fieldName);

(Perhaps we want to expand this to take in other parameters such as the field metadata (term vector, etc.))

Anyone can implement a FieldSelector to define how they want to load fields for a Document.
The FieldSelectorResult can be one of four values: LOAD, LAZY_LOAD, NO_LOAD, LOAD_AND_BREAK.
The FieldsReader, as it is looping over the FieldsInfo, applies the FieldSelector to determine what should be done with the current field.

I modeled this after the java.io.FileFilter mechanism. There are two implementations to date: SetBasedFieldSelector and LoadFirstFieldSelector. The former takes in two sets of field names, one to load immed. and one to load lazily. The latter returns LOAD_AND_BREAK on the first field encountered. See TestFieldsReader for examples.

It should support UTF-8 (I borrowed code from Issue 509, thanks!). See TestFieldsReader for examples

I added an expert method on IndexInput named skipChars that takes in the number of characters to skip. This is a compromise on changing the file format of the fields to better support seeking. It does some of the work of readChars, but not all of it. It doesn't require buffer storage and it doesn't do the bitwise operations. It just reads in the appropriate number of bytes and promptly ignores them. This is useful for skipping non-binary, non-compressed stored fields.

The biggest change is by far the introduction of the Fieldable interface (as per Doug's suggestion from a mailing list email on Lazy Field loading from a while ago). Field is now a Fieldable. All uses of Field have been changed to use Fieldable. FieldsReader.LazyField also implements Fieldable.

Lazy Field loading is now implemented. It has a major caveat (that is Documented) in that it clones the underlying IndexInput upon lazy access to the Field value. IT IS UNDEFINED whether a Lazy Field can be loaded after the IndexInput parent has been closed (although, from what I saw, it does work). I thought about adding a reattach method, but it seems just as easy to reload the document. See the TestFieldsReader and DocHelper for examples.

I updated a couple of other tests to reflect the new fields that are on the DocHelper document.

All tests pass.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Grant Ingersoll added a comment - 16/Apr/06 03:57 AM
All main tests pass and most of contrib (I think the failures are unrelated to my changes). All of contrib compiles as well.

As always, let me know if I can help with anything.


Grant Ingersoll made changes - 16/Apr/06 03:57 AM
Field Original Value New Value
Attachment fieldSelectorPatch.txt [ 12325383 ]
Grant Ingersoll added a comment - 28/Apr/06 02:29 AM
Forgot the new files.

Grant Ingersoll made changes - 28/Apr/06 02:29 AM
Attachment newFiles.tar.gz [ 12325964 ]
Chuck Williams added a comment - 03/May/06 04:30 PM
Continuing the discussion from Lucene-558, LazyFields.tar.gz extends this patch (Lucene-545) with an additional optimization so that ParallelReader does not read fields from readers all of whose fields are NO_LOAD. No change to the FieldSelector interface was required to achieve this. Also, a useful new FieldSelector is provided, MapFieldSelector, and TestParallelReader is extended to test these things.

Bug fixes to ParallelReader from Lucene-561 are also included.

Keeping everything involved factored and managing this with my other local changes has led to a slightly more complex file structure. The steps to use LazyFields.tar.gz are:

Unpack it
Apply fieldSelectorPatch.txt
Apply ParallelReader.patch
Apply TestParallelReader.patch
Unpack and copy fieldSelectorNewFiles.tar.gz
Copy LazyFields.new

The target of all patch applications and copies is the Lucene trunk.

When I applied fieldSelectorPatch.txt against the latest Lucene trunk, a couple hunks failed to apply, but they were not relevant. The version included here is the original version unchanged.


Chuck Williams made changes - 03/May/06 04:30 PM
Attachment LazyFields.tar.gz [ 12326193 ]
Grant Ingersoll made changes - 05/Jun/06 07:58 AM
Assignee Grant Ingersoll [ gsingers ]
Grant Ingersoll made changes - 05/Jun/06 07:59 AM
Status Open [ 1 ] In Progress [ 3 ]
Grant Ingersoll added a comment - 08/Jun/06 07:14 AM
Chuck,

I think the patch is missing your ListFieldSelector.java file. Can you please attach it?

Thanks,
Grant


Chuck Williams added a comment - 08/Jun/06 08:38 AM
Grant,

ListFieldSelector is not part of the update as MapFieldSelector subsumed its functionality. It appears there is a dangling import of ListFieldSelector in TestParallelReader. This line should just be deleted.

I'd update the patch, but a) fieldSelectorPatch.txt does not apply cleanly against the current svn head, and b) my local copy has other modifications and so I can't reliably extract just this patch.

Also, there is one minor fix/improvment to ParallelReader.document(). The iterator over the fields of a subreader should not be constructed if it won'd be used, i.e., move the construction inside the first if:

boolean include = (fieldSelector==null);
if (!include) {
Iterator it = ((Collection)readerToFields.get(reader)).iterator();

Sorry about these issues...

FYI, I've been using the capability extensively and it is working well. Thanks for lazy fields!

Chuck


Grant Ingersoll added a comment - 08/Jun/06 06:08 PM
OK, I will take care of it, thanks!

Grant Ingersoll added a comment - 10/Jun/06 08:40 AM
Thanks, Chuck, for the assistance!

Grant Ingersoll made changes - 10/Jun/06 08:40 AM
Resolution Fixed [ 1 ]
Status In Progress [ 3 ] Resolved [ 5 ]
Hoss Man made changes - 11/Jun/06 04:01 AM
Link This issue is related to LUCENE-558 [ LUCENE-558 ]