Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: search
    • Labels:
      None

      Description

      Add lazy field loading to solr.

      Currently solr reads all stored fields and filters the undesired fields based on the field list. This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.

      Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache. To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc call. However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.

      Some concerns about lazy field loading
      1. Lazy field are only valid while the IndexReader is open. I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.
      2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors). We certainly don't want this to be the common case. I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori. This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
      3. LazyFields are incompatible with the sundry Field declarations scattered about Solr. I believe I've changed all the necessary locations to Fieldable.

      Comments appreciated

      1. lazyfields_patch.diff
        35 kB
        Mike Klaas
      2. lazyfields_patch.diff
        33 kB
        Mike Klaas
      3. lazyfields_patch.diff
        32 kB
        Mike Klaas
      4. lazyfields_patch.diff
        32 kB
        Mike Klaas

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          +1, looks good.

          There are some small backward incompatabilities (any place that returns a Fieldable, like getUniqueKeyField), but it can't be helped, and it's fairly expert level anyway.

          My only concern was about a memory increase for lazy-loaded short fields. I reviewed some of the LazyField code just now, and it looks like this shouldn't be the case:

          • LazyField is an inner class that contains an extra 3 members. It's outer class that it will retain a reference to is FieldsReader. The fieldsReader instance is a member of SegmentReader, and has the same lifetime as the SegmentReader. Hence the LazyField won't extend the lifetime of any other objects.

          One thing I did see is the internal char[] buffer used to read the string in LazyField is a member for some reason (hence the data will be stored in the field twice for some reason). I think this is probably a bug, and I'll bring it up on the Lucene list.

          Ideas for future optimizations:

          • if there is no document cache, change lazy to no-load
          • special cases: if only a single field (like the ID field) is selected out of many documents to be return, consider bypassing doc cache and use LOAD_AND_BREAK if we know there is only a single value.
          Show
          Yonik Seeley added a comment - +1, looks good. There are some small backward incompatabilities (any place that returns a Fieldable, like getUniqueKeyField), but it can't be helped, and it's fairly expert level anyway. My only concern was about a memory increase for lazy-loaded short fields. I reviewed some of the LazyField code just now, and it looks like this shouldn't be the case: LazyField is an inner class that contains an extra 3 members. It's outer class that it will retain a reference to is FieldsReader. The fieldsReader instance is a member of SegmentReader, and has the same lifetime as the SegmentReader. Hence the LazyField won't extend the lifetime of any other objects. One thing I did see is the internal char[] buffer used to read the string in LazyField is a member for some reason (hence the data will be stored in the field twice for some reason). I think this is probably a bug, and I'll bring it up on the Lucene list. Ideas for future optimizations: if there is no document cache, change lazy to no-load special cases: if only a single field (like the ID field) is selected out of many documents to be return, consider bypassing doc cache and use LOAD_AND_BREAK if we know there is only a single value.
          Hide
          Yonik Seeley added a comment -

          The one other memory increase I can see from using lazy fields is due to the thread local... a cloned IndexInput (containing a 1K byte buffer + other object overhead). That shouldn't be a big deal since it's related to the number of different threads used to access lazy loaded fields, and not directly to the number of lazy fields themselves.

          In any case, your optimization of retrieving all the fields needed for the request probably prevents many lazy field invocations.

          Show
          Yonik Seeley added a comment - The one other memory increase I can see from using lazy fields is due to the thread local... a cloned IndexInput (containing a 1K byte buffer + other object overhead). That shouldn't be a big deal since it's related to the number of different threads used to access lazy loaded fields, and not directly to the number of lazy fields themselves. In any case, your optimization of retrieving all the fields needed for the request probably prevents many lazy field invocations.
          Hide
          Yonik Seeley added a comment -

          > hence the data will be stored in the field twice for some reason

          FYI, I just checked in a Lucene fix for this.

          Show
          Yonik Seeley added a comment - > hence the data will be stored in the field twice for some reason FYI, I just checked in a Lucene fix for this.
          Hide
          Mike Klaas added a comment -

          updated version of patch. Addresses some of Hoss' (minor) comments. Also, the .doc() method of SolrIndexSearcher will added the unique key field unconditionally if it is present in the schema. IT is used randomly in several places and including checks for it in other places decreases readability.

          Show
          Mike Klaas added a comment - updated version of patch. Addresses some of Hoss' (minor) comments. Also, the .doc() method of SolrIndexSearcher will added the unique key field unconditionally if it is present in the schema. IT is used randomly in several places and including checks for it in other places decreases readability.
          Hide
          Mike Klaas added a comment -

          Note the above patch does not address the issue of lazy field use mismatch between two handlers (see solr-dev)

          Show
          Mike Klaas added a comment - Note the above patch does not address the issue of lazy field use mismatch between two handlers (see solr-dev)
          Hide
          Mike Klaas added a comment -

          Moved id field selection out of SolrIndexSearcher.doc()

          Chris: What would you like to see vis-a-vis the many field issues before committing? Should we put in a global lazy-field-disable option?

          Show
          Mike Klaas added a comment - Moved id field selection out of SolrIndexSearcher.doc() Chris: What would you like to see vis-a-vis the many field issues before committing? Should we put in a global lazy-field-disable option?
          Hide
          Yonik Seeley added a comment -

          FYI, we need a lucene refresh before we use lazy fields because of this:
          http://issues.apache.org/jira/browse/LUCENE-683

          Show
          Yonik Seeley added a comment - FYI, we need a lucene refresh before we use lazy fields because of this: http://issues.apache.org/jira/browse/LUCENE-683
          Hide
          Mike Klaas added a comment -

          This version adds a solrconfig parameter which allows lazy fields to be enabled or disabled (disabled by default).

          Still needs testing after syncing with lucene changes

          Show
          Mike Klaas added a comment - This version adds a solrconfig parameter which allows lazy fields to be enabled or disabled (disabled by default). Still needs testing after syncing with lucene changes
          Hide
          Hoss Man added a comment -

          Mike, how do you feel about these changes so far?

          FWIW: if you still aren't comfortable resolving, you should probably mark this isssue as "In Progress" so it stays on the radar.

          Show
          Hoss Man added a comment - Mike, how do you feel about these changes so far? FWIW: if you still aren't comfortable resolving, you should probably mark this isssue as "In Progress" so it stays on the radar.
          Hide
          Mike Klaas added a comment -

          Thanks for reminding me, Hoss.

          Changes committed in r479793. I will reopen if I discover problems.

          Show
          Mike Klaas added a comment - Thanks for reminding me, Hoss. Changes committed in r479793. I will reopen if I discover problems.
          Hide
          Mike Klaas added a comment -

          committed in r479793

          Show
          Mike Klaas added a comment - committed in r479793
          Hide
          Hoss Man added a comment -

          This bug was modified as part of a bulk update using the criteria...

          • Marked ("Resolved" or "Closed") and "Fixed"
          • Had no "Fix Version" versions
          • Was listed in the CHANGES.txt for 1.1

          The Fix Version for all 38 issues found was set to 1.1, email notification
          was suppressed to prevent excessive email.

          For a list of all the issues modified, search jira comments for this
          (hopefully) unique string: 20080415hossman3

          Show
          Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.1 The Fix Version for all 38 issues found was set to 1.1, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman3

            People

            • Assignee:
              Mike Klaas
              Reporter:
              Mike Klaas
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development