Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10273

Re-order largest field values last in Lucene Document

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.5
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      (part of umbrella issue SOLR-10117)
      In Solr's DocumentBuilder, at the very end, we should move the field value(s) associated with the largest field (assuming "stored") to be last. Lucene's default stored value codec can avoid reading and decompressing the last field value when it's not requested. (As of LUCENE-6898).

        Issue Links

          Activity

          Hide
          dsmiley David Smiley added a comment -

          Here's a patch with test. It also incorporates a minimum long length of 1024 that is overrideable via solr.docBuilder.minLengthToMoveLast (an internal unsupported setting). If no field is at least that long then it won't bother moving it. In tests temporarily I set this to 0 and no existing test broke, which is a good sign. If someone were to observe the ordering that fields come back from Solr with a '*', they might notice a change in ordering after this. Of course people shouldn't depend on inter-field ordering.

          Show
          dsmiley David Smiley added a comment - Here's a patch with test. It also incorporates a minimum long length of 1024 that is overrideable via solr.docBuilder.minLengthToMoveLast (an internal unsupported setting). If no field is at least that long then it won't bother moving it. In tests temporarily I set this to 0 and no existing test broke, which is a good sign. If someone were to observe the ordering that fields come back from Solr with a '*', they might notice a change in ordering after this. Of course people shouldn't depend on inter-field ordering.
          Hide
          mbraun688 Michael Braun added a comment -

          In LUCENE-6898 a comment says it doesn't have an impact if the last stored value is under 16K - should the value be higher than 1024 by default?

          Show
          mbraun688 Michael Braun added a comment - In LUCENE-6898 a comment says it doesn't have an impact if the last stored value is under 16K - should the value be higher than 1024 by default?
          Hide
          dsmiley David Smiley added a comment -

          True; it's debatable... I nearly added a comment about being inclined to raise this min length to something higher so I'm glad you brought it up. That Lucene side value might change in the future or based on a user-chosen codec; we needn't track it exactly. Also, just because the longest field is 1024 doesn't mean the document overall is "small" because theoretically there could be a ton of stored values instead of one particularly large one. Perhaps change to 4KB default? Shrug.

          Show
          dsmiley David Smiley added a comment - True; it's debatable... I nearly added a comment about being inclined to raise this min length to something higher so I'm glad you brought it up. That Lucene side value might change in the future or based on a user-chosen codec; we needn't track it exactly. Also, just because the longest field is 1024 doesn't mean the document overall is "small" because theoretically there could be a ton of stored values instead of one particularly large one. Perhaps change to 4KB default? Shrug.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Is there a way to check this while building the Document from the SolrInputDocument instead (may be cheaper?)
          For multi-valued fields, perhaps we should be using the sum of the multiple fields?
          As a generalization we could also consider sorting by size, not just picking out the largest single field.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Is there a way to check this while building the Document from the SolrInputDocument instead (may be cheaper?) For multi-valued fields, perhaps we should be using the sum of the multiple fields? As a generalization we could also consider sorting by size, not just picking out the largest single field.
          Hide
          dsmiley David Smiley added a comment -

          Is there a way to check this while building the Document from the SolrInputDocument instead (may be cheaper?)

          I briefly contemplated having DocumentBuilder internally collect the Lucene IndexableField instances into some other internal Doc-like inner class that could maintain the largest value as it goes. But that seems over-engineered, and the post-process scanning code later seems pretty quick to me.

          For multi-valued fields, perhaps we should be using the sum of the multiple fields? As a generalization we could also consider sorting by size, not just picking out the largest single field.

          In the entire Lucene+Solr codebase, the only place where StoredFieldVisitor.Status.STOP is actually used is the Unified/Postings highlighters, and only when one field is being highlighted. So if there was an overall large document (>16KB), and if we didn't move the 2nd largest value to the end, and if you wanted to highlight on this 2nd largest value alone, and if there were some additional sizable fields inbetween this 2nd largest value and the last one.... then yes we're doing more work. I don't think it's worth bothering with right now?

          BTW when I commit this patch, I'll change the min size threshold to 4KB

          Show
          dsmiley David Smiley added a comment - Is there a way to check this while building the Document from the SolrInputDocument instead (may be cheaper?) I briefly contemplated having DocumentBuilder internally collect the Lucene IndexableField instances into some other internal Doc-like inner class that could maintain the largest value as it goes. But that seems over-engineered, and the post-process scanning code later seems pretty quick to me. For multi-valued fields, perhaps we should be using the sum of the multiple fields? As a generalization we could also consider sorting by size, not just picking out the largest single field. In the entire Lucene+Solr codebase, the only place where StoredFieldVisitor.Status.STOP is actually used is the Unified/Postings highlighters, and only when one field is being highlighted. So if there was an overall large document (>16KB), and if we didn't move the 2nd largest value to the end, and if you wanted to highlight on this 2nd largest value alone, and if there were some additional sizable fields inbetween this 2nd largest value and the last one.... then yes we're doing more work. I don't think it's worth bothering with right now? BTW when I commit this patch, I'll change the min size threshold to 4KB
          Hide
          rcmuir Robert Muir added a comment -

          This is a big deoptimization for the common case... Lucene must preserve field order and you lose compression if its inconsistent across docs. Also bulk merging relies upon field number consistency across segments. IW tries to keep them aligned but this patch os intentionally being adversarial... To benefit what is a rare and esoteric case.

          This kind of thing should be only enabled by an option.

          Show
          rcmuir Robert Muir added a comment - This is a big deoptimization for the common case... Lucene must preserve field order and you lose compression if its inconsistent across docs. Also bulk merging relies upon field number consistency across segments. IW tries to keep them aligned but this patch os intentionally being adversarial... To benefit what is a rare and esoteric case. This kind of thing should be only enabled by an option.
          Hide
          dsmiley David Smiley added a comment -

          Thanks for alerting me to this Rob! Is there a size threshold at which you think it's not a de-optimization – perhaps the 16KB mark? I suppose your point is consistency... so if we always move the values for certain fields last then there's no problem?

          Also bulk merging relies upon field number consistency across segments

          Can you point me to a line of code in CompressingStoredFieldsWriter that is pertinent? I don't see it.

          Show
          dsmiley David Smiley added a comment - Thanks for alerting me to this Rob! Is there a size threshold at which you think it's not a de-optimization – perhaps the 16KB mark? I suppose your point is consistency... so if we always move the values for certain fields last then there's no problem? Also bulk merging relies upon field number consistency across segments Can you point me to a line of code in CompressingStoredFieldsWriter that is pertinent? I don't see it.
          Hide
          dsmiley David Smiley added a comment -

          Michael McCandless might you know what Rob is referring to? I'd like to see where this happens in Lucene so I can learn more about it. I've been looking around a bit, like in SegmentMerger.

          If there's really an issue here, I could modify the patch to ignore field sizes and instead look for only the fields declared as "large" (SOLR-10286).

          Show
          dsmiley David Smiley added a comment - Michael McCandless might you know what Rob is referring to? I'd like to see where this happens in Lucene so I can learn more about it. I've been looking around a bit, like in SegmentMerger. If there's really an issue here, I could modify the patch to ignore field sizes and instead look for only the fields declared as "large" ( SOLR-10286 ).
          Hide
          mikemccand Michael McCandless added a comment -

          David Smiley have a look @ oal.codecs.compressing.MatchingReaders ... that's where we compute whether the field numbers are congruent across segments so bulk merge can apply (or not).

          Show
          mikemccand Michael McCandless added a comment - David Smiley have a look @ oal.codecs.compressing.MatchingReaders ... that's where we compute whether the field numbers are congruent across segments so bulk merge can apply (or not).
          Hide
          dsmiley David Smiley added a comment -

          Thanks for the pointer Michael McCandless! Wow... it seems index/merge performance could vary quite a bit based on something that seems to me very fragile (not considering large/small fields here; just in general). Why doesn't Lucene sort the FieldInfos such that the field number ascends as the field name alphabetically ascends? I can file an issue if you think that would be a net benefit. Even with that, it's a shame a bulk merge can't happen if some fields simply aren't present in some segments yet are in others. Perhaps again, Lucene could be improved to look across the segments and add all FieldInfo(s) to the segment being written that are in others but not the current one? Perhaps not if doing so would add > X FieldInfos. I have not looked at this part of Lucene in-depth. I suspect that the reason these ideas have yet to be implemented may be because FieldInfo(s) need to be generated in advance of knowing all those that may need to exist.

          Show
          dsmiley David Smiley added a comment - Thanks for the pointer Michael McCandless ! Wow... it seems index/merge performance could vary quite a bit based on something that seems to me very fragile (not considering large/small fields here; just in general). Why doesn't Lucene sort the FieldInfos such that the field number ascends as the field name alphabetically ascends? I can file an issue if you think that would be a net benefit. Even with that, it's a shame a bulk merge can't happen if some fields simply aren't present in some segments yet are in others. Perhaps again, Lucene could be improved to look across the segments and add all FieldInfo(s) to the segment being written that are in others but not the current one? Perhaps not if doing so would add > X FieldInfos. I have not looked at this part of Lucene in-depth. I suspect that the reason these ideas have yet to be implemented may be because FieldInfo(s) need to be generated in advance of knowing all those that may need to exist.
          Hide
          dsmiley David Smiley added a comment -

          I'm looking at IndexWriter.globalFieldNumberMap which is initialized from the current segments. This is then referenced by DocumentsWriter for each DocumentsWriterPerThread. It seems this isn't so fragile after all, with respect to ordering?

          Show
          dsmiley David Smiley added a comment - I'm looking at IndexWriter.globalFieldNumberMap which is initialized from the current segments. This is then referenced by DocumentsWriter for each DocumentsWriterPerThread. It seems this isn't so fragile after all, with respect to ordering?
          Hide
          mikemccand Michael McCandless added a comment -

          Hi David Smiley, yes, you are right: IndexWriter now tries very hard to use consistent field numbers using its global field number map. This isn't always possible, e.g. addIndexes(Directory[]) can bring in inconsistent numbers, and so the matching logic in the bulk merging is still necessary, but I think it should be safe for you to re-order the stored fields here.

          Show
          mikemccand Michael McCandless added a comment - Hi David Smiley , yes, you are right: IndexWriter now tries very hard to use consistent field numbers using its global field number map. This isn't always possible, e.g. addIndexes(Directory[]) can bring in inconsistent numbers, and so the matching logic in the bulk merging is still necessary, but I think it should be safe for you to re-order the stored fields here.
          Hide
          dsmiley David Smiley added a comment -

          Wonderful, then I'll commit this patch (with the 4k threshold) in a few hours or so.

          Show
          dsmiley David Smiley added a comment - Wonderful, then I'll commit this patch (with the 4k threshold) in a few hours or so.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8fbd9f1e403cc697f77d827cd1aa85876c8665ae in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8fbd9f1 ]

          SOLR-10273: DocumentBuilder move longest field to last position

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8fbd9f1e403cc697f77d827cd1aa85876c8665ae in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8fbd9f1 ] SOLR-10273 : DocumentBuilder move longest field to last position
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 993003b33e33ba78c66ffda41acb12b8239c359a in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=993003b ]

          SOLR-10273: DocumentBuilder move longest field to last position

          (cherry picked from commit 8fbd9f1)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 993003b33e33ba78c66ffda41acb12b8239c359a in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=993003b ] SOLR-10273 : DocumentBuilder move longest field to last position (cherry picked from commit 8fbd9f1)
          Hide
          mbraun688 Michael Braun added a comment -

          Saw this was reopened, is this not fully implemented in the 6.5.1 RC?

          Show
          mbraun688 Michael Braun added a comment - Saw this was reopened, is this not fully implemented in the 6.5.1 RC?
          Hide
          mbraun688 Michael Braun added a comment -

          Please ignore my comment - wrong ticket!

          Show
          mbraun688 Michael Braun added a comment - Please ignore my comment - wrong ticket!

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              dsmiley David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development