Lucene natively no longer has support for lazy field loading, but there is a "backwards layer" just for Solr in modules/misc (LazyDocument.java)
Yeah .. LazyDocument is seriously evil...
The document does not use maps to lookup, if you have many fields its always a scan through the ArrayList of all fields in the document.
It's worse then that though – having many fields and scanning through them all to fetch a single field value (or an array of field values for a single name) is a cost that has to be paid in 4.x regardless of whether you are using LazyDocument or not, the root problem here seems to be having fields that contain many values. That problem is exacerbated by the fact that unlike 3.x lazy loading, 4.x LaxyDocument/LazyField doesn't do anything to "cache" the fields you've already asked for.
Below are my notes from investigating this and trying to get up to speed on the new world order of document loading w/o FieldSelector. I'll experiment with some fixes after i get some food...
LUCENE-2308 - r1162347
- IndexReader.doc(int,FieldSelector) deleted
- FieldSelector moved to misc
- new concept StoredFieldVisitor introduced
- void IndexReader.document(int docID, StoredFieldVisitor visitor)
- new impl DocumentStoredFieldVisitor extends StoredFieldVisitor
- new impl FieldSelectorVisitor extends StoredFieldVisitor
- appears all the old FieldSelector logic from IndexReader moved here?
- contains a private "LazyField extends Field" that caches field values once fetched
- SolrIndexSearcher modified to use FieldSelectorVisitor
LUCENE-2621 - r1199779
- eliminates FieldSelector & FieldSelectorVisitor
- leaves StoredFieldVisitor & DocumentStoredFieldVisitor intact
- introduced public LazyDocument containing "LazyField implements IndexableField"
- this version of LazyField does not cache any data once fetched
- changes SolrIndexSearcher's SetNonLazyFieldSelector to extend StoredFieldVisitor
- add's LazyField to the Document for any fields not immediately needed
The crux of the problem is that:
- LazyDocument is lazy about loading the doc, but once you ask for the value any LazyField, the entire Document (with all underlying IndexableField values) is loaded.
- even though the entire document has been loaded once a single LazyField is used, the performance of iterating over LazyField's is TERRIBLE when there are lots of values for a single field
- requests for the value of individual LazyFields are not cached/stored anywhere, so the poor performace affects all subsequent re-uses of the same LazyDocuments
The state tracked in a LazyField is a refrence back to the underlying LazyDocument, the field name, and the "num" offset of this IndexableField in the list of values for that field name. When you ask the LazyField for it's value, it asks the underlying LazyDocument to fetch the entire Document (if it hasn't already) and then it asks that Document for all values of the assocaited field name as an arry, and then it looks up it's "num" offset in that array.
So if you build up an (outer) Document containing N LazyField instances for field named "foo" (as is done in Solr's SetNonLazyFieldSelector), and then try to iterate over the values with something like String values = outerDoc.getValues("foo"); under the covers LazyField will load every value of every field of that document into memory as an "innerDoc", that innerDoc will be asked N times to generate a new IndexableField of every value of field "foo" (which BTW: involves iterating over every IndexableField value of every field) and N-1 elements of that array will then be ignored and thrown away.
It's not clear to me why FieldSelectorVisitor was eliminated in
LUCENE-2621 (no discussion in the comments on point) but it's also not clear to me why LazyDocument+LazyField would ever be a good idea in any application that had more then a handful of fields (and if you don't have very many fields, why are you lazy loading?).
It's also not clear to me why the LazyDocument version of LazyField doesn't include the same caching logic as the version that was included in FieldSelectorVisitor (or the older lazy loading code in 3.6) because w/o that the usage pattern in Solr – in which Document objects are cached – results in the worst of all possible worlds: once a Document is cached with only a small subset of "real" fields, and the rest are "LazyField" instances, every subsequent request for that document that involves those LazyFields is slow, even if they ask for the same LazyField over and over.