Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10286

Declare a field as "large", don't keep value in the document cache

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.5, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      (part of umbrella issue SOLR-10117)
      This adds a field to be declared as "large" in the schema. In the SolrIndexSearcher.doc(...) handling, these fields are lazily fetched from Lucene. Unlike LazyDocument.LazyField, it's not cached after first-use unless the value is "small" < 512KB by default. "large" can only be used when its stored="true" and multiValued="false" and the field is otherwise compatible (basically not a numeric field) – you'll get a helpful exception if it's unsupported. BinaryField is not yet supported at this time; it could be in the future.

      1. SOLR-10286_large_fields.patch
        43 kB
        David Smiley
      2. SOLR-10286_large_fields.patch
        43 kB
        David Smiley
      3. tests-failures.txt
        1.93 MB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          dsmiley David Smiley added a comment -

          Here's a patch. The whole test suite has passed twice now.

          Nocommits:

          • SchemaField.isLarge will default to true for the purposes of testing during development of this feature. This was extremely useful. It should of course be false.
          • SolrIndexSearcher: I want to refactor/move all the Lucene Document related code and docValuesAsStored type code out of here into a new companion class named SolrDocumentFetcher. I didn't do that yet as I want the patch to show what's changed more clearly.
          • BaseEditorialTransformer (related to query elevation): This made bad instanceof assumptions that I fixed, but I found the code to be too loosy goosey to my liking on toString'ing whatever it didn't understand (I hate that in general; leads to hard-to-find bugs). I don't think it could happen so I added an "assert false". Now that I see all tests pass, I'm inclined to make it fail hard.

          Schema package:

          • FieldProperties: converted the bit masks from hex to instead use Java 8's boolean literal. Much clearer!
          • Question: Yonik Seeley what is BINARY for? This isn't used anywhere and the line of code dates back to Solr's initial Apache contribution. For a moment I thought I could use it as the same as a BinaryField check but apparently not.
          • FieldType.checkSchemaField only used to test for docValues compatibility and subclasses would override this to add a no-op. I think that design was poor as it's too all-encompassing, so I made it call a new checkSupportsDocValues() and had the applicable subclasses override that instead.
          • FieldType.checkSchemaField now checks for "large" compatibility – multiValued, stored, not-a-number. BinaryField overrides to throw as well as that hasn't been implemented yet.

          SolrIndexSearcher:

          • I refactored the doc() handling to always use a custom StoredFieldVisitor, which I think makes it clearer. This may also make it easier to add a Status.STOP optimization for single-valued fields but I didn't get to that.
          • When the Unified/Postings highlighters supply their custom StoredFieldVisitor and match an already cached document's large field, this code will avoid a double-string conversion, reducing heap memory pressure.

          Tests:

          • The test is pretty basic; good enough? It'd be nice to add a test to the Solr UnifiedHighlighter related stuff to randomly use this field. It's at least an opt-in feature so I'm not too worried... not to mention I ran this with a default large'ness to tease out bugs. I wonder if the default large'ness could/should be flipped randomly by Solr's test infrastructure?

          Bugs found/fixed:

          • In a couple places in Solr, there was an assumption that the Lucene IndexableField was actually an instance of Field. Two cases are seen as fixed in this patch:
            • DocumentBuilder.addField. It appears in-place updates might not have worked in some cases involving lazy fields, depending on the usage pattern.
            • BaseEditorialTransformer (query elevation).
          • RealTimeGetComponent: RTG can internally grab a ref-counted realtime searcher, lookup a document, then dec-ref the searcher. If the searcher is subsequently closed, the lazy field can't get the value anymore. Theoretically this problem could happen with Solr's standard lazy fields too but a "large" field is better at provoking it. I fixed this by essentially copying the IndexableField. It'd be nice if Lucene Field had a copy-constructor of an IndexableField; I was forced to subclass to accomplish the same.

          Although not a strict requirement, ideally SOLR-10273 (largest field last) is also done.

          Show
          dsmiley David Smiley added a comment - Here's a patch. The whole test suite has passed twice now. Nocommits: SchemaField.isLarge will default to true for the purposes of testing during development of this feature. This was extremely useful. It should of course be false. SolrIndexSearcher: I want to refactor/move all the Lucene Document related code and docValuesAsStored type code out of here into a new companion class named SolrDocumentFetcher . I didn't do that yet as I want the patch to show what's changed more clearly. BaseEditorialTransformer (related to query elevation): This made bad instanceof assumptions that I fixed, but I found the code to be too loosy goosey to my liking on toString'ing whatever it didn't understand (I hate that in general; leads to hard-to-find bugs). I don't think it could happen so I added an "assert false". Now that I see all tests pass, I'm inclined to make it fail hard. Schema package: FieldProperties: converted the bit masks from hex to instead use Java 8's boolean literal. Much clearer! Question: Yonik Seeley what is BINARY for? This isn't used anywhere and the line of code dates back to Solr's initial Apache contribution. For a moment I thought I could use it as the same as a BinaryField check but apparently not. FieldType.checkSchemaField only used to test for docValues compatibility and subclasses would override this to add a no-op. I think that design was poor as it's too all-encompassing, so I made it call a new checkSupportsDocValues() and had the applicable subclasses override that instead. FieldType.checkSchemaField now checks for "large" compatibility – multiValued, stored, not-a-number. BinaryField overrides to throw as well as that hasn't been implemented yet. SolrIndexSearcher: I refactored the doc() handling to always use a custom StoredFieldVisitor, which I think makes it clearer. This may also make it easier to add a Status.STOP optimization for single-valued fields but I didn't get to that. When the Unified/Postings highlighters supply their custom StoredFieldVisitor and match an already cached document's large field, this code will avoid a double-string conversion, reducing heap memory pressure. Tests: The test is pretty basic; good enough? It'd be nice to add a test to the Solr UnifiedHighlighter related stuff to randomly use this field. It's at least an opt-in feature so I'm not too worried... not to mention I ran this with a default large'ness to tease out bugs. I wonder if the default large'ness could/should be flipped randomly by Solr's test infrastructure? Bugs found/fixed: In a couple places in Solr, there was an assumption that the Lucene IndexableField was actually an instance of Field . Two cases are seen as fixed in this patch: DocumentBuilder.addField . It appears in-place updates might not have worked in some cases involving lazy fields, depending on the usage pattern. BaseEditorialTransformer (query elevation). RealTimeGetComponent: RTG can internally grab a ref-counted realtime searcher, lookup a document, then dec-ref the searcher. If the searcher is subsequently closed, the lazy field can't get the value anymore. Theoretically this problem could happen with Solr's standard lazy fields too but a "large" field is better at provoking it. I fixed this by essentially copying the IndexableField. It'd be nice if Lucene Field had a copy-constructor of an IndexableField; I was forced to subclass to accomplish the same. Although not a strict requirement, ideally SOLR-10273 (largest field last) is also done.
          Hide
          dsmiley David Smiley added a comment -

          One small change I should do is to have the SolrIndexSearcher.doc code only do it's large field handling if either there is a DocumentCache, or if lazy field loading is enabled. If neither are true, there's no point to the special LargeLazyField.

          One idea I rejected early (in parent issue SOLR-10117) I should mention here is that it's not realistic to use the LargeLazyField dynamically based on the actual value size. If the value is huge, then it's effectively too late – you've already born the cost of fetching it (to observe its size) and putting it on the heap. So we have to say we're going to do this always for certain fields. SOLR-10255 included an idea of putting a special marker stored value to give this code a heads-up that certain values on a particular document are large, thus allowing the decision to be made dynamically. That could be added in the future still.

          Show
          dsmiley David Smiley added a comment - One small change I should do is to have the SolrIndexSearcher.doc code only do it's large field handling if either there is a DocumentCache, or if lazy field loading is enabled. If neither are true, there's no point to the special LargeLazyField. One idea I rejected early (in parent issue SOLR-10117 ) I should mention here is that it's not realistic to use the LargeLazyField dynamically based on the actual value size. If the value is huge, then it's effectively too late – you've already born the cost of fetching it (to observe its size) and putting it on the heap. So we have to say we're going to do this always for certain fields. SOLR-10255 included an idea of putting a special marker stored value to give this code a heads-up that certain values on a particular document are large, thus allowing the decision to be made dynamically. That could be added in the future still.
          Hide
          dsmiley David Smiley added a comment -

          I filed SOLR-10304 for refactoring out a SolrDocumentFetcher from SolrIndexSearcher. A large-ish refactoring like that should really should be it's own commit separate from the changes here.

          New patch. Resolves the nocommits. I enhanced TestUnifiedSolrHighlighter to test a "large" field, which I verified by setting a breakpoint and I watched it get hit. I added the check to skip large field processing when there is no document cache.

          I plan to commit this tomorrow ~ noon EST.

          Show
          dsmiley David Smiley added a comment - I filed SOLR-10304 for refactoring out a SolrDocumentFetcher from SolrIndexSearcher. A large-ish refactoring like that should really should be it's own commit separate from the changes here. New patch. Resolves the nocommits. I enhanced TestUnifiedSolrHighlighter to test a "large" field, which I verified by setting a breakpoint and I watched it get hit. I added the check to skip large field processing when there is no document cache. I plan to commit this tomorrow ~ noon EST.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 2502af9f3fa25a1b724400af61bf74102f2475dd in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2502af9 ]

          SOLR-10286: large fields.
          And refactored FieldType.checkSchemaField to call a new checkSupportsDocValues()

          Show
          jira-bot ASF subversion and git services added a comment - Commit 2502af9f3fa25a1b724400af61bf74102f2475dd in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2502af9 ] SOLR-10286 : large fields. And refactored FieldType.checkSchemaField to call a new checkSupportsDocValues()
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 99fc669f8f17846a20b74e09212fb9c02bc69581 in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=99fc669 ]

          SOLR-10286: large fields.
          And refactored FieldType.checkSchemaField to call a new checkSupportsDocValues()

          (cherry picked from commit 2502af9)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 99fc669f8f17846a20b74e09212fb9c02bc69581 in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=99fc669 ] SOLR-10286 : large fields. And refactored FieldType.checkSchemaField to call a new checkSupportsDocValues() (cherry picked from commit 2502af9)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4ee7fc38907a94f025785ebd388dd372b260913d in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4ee7fc3 ]

          SOLR-10286: fix test; we were writing to read-only dir.
          Expand solrconfig-managed-schema.xml to have toggle-able elements vis system property flags

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4ee7fc38907a94f025785ebd388dd372b260913d in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4ee7fc3 ] SOLR-10286 : fix test; we were writing to read-only dir. Expand solrconfig-managed-schema.xml to have toggle-able elements vis system property flags
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4628cf478373faa6116bf6e41e04337d71db2e9d in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4628cf4 ]

          SOLR-10286: fix test; we were writing to read-only dir.
          Expand solrconfig-managed-schema.xml to have toggle-able elements vis system property flags

          (cherry picked from commit 4ee7fc3)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4628cf478373faa6116bf6e41e04337d71db2e9d in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4628cf4 ] SOLR-10286 : fix test; we were writing to read-only dir. Expand solrconfig-managed-schema.xml to have toggle-able elements vis system property flags (cherry picked from commit 4ee7fc3)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 65c163c3e3b9664c7309085b7d2271888bb6d163 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=65c163c ]

          SOLR-10286: fix precommit

          Show
          jira-bot ASF subversion and git services added a comment - Commit 65c163c3e3b9664c7309085b7d2271888bb6d163 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=65c163c ] SOLR-10286 : fix precommit
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4a55bc4e0f7a5b227f774fa3d7bbf4f1a4767eb1 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4a55bc4 ]

          SOLR-10286: fix precommit (unused imports)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4a55bc4e0f7a5b227f774fa3d7bbf4f1a4767eb1 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4a55bc4 ] SOLR-10286 : fix precommit (unused imports)
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -
          Show
          yseeley@gmail.com Yonik Seeley added a comment - This test just failed for me. Seems like it fails for jenkins sometimes also: https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Windows/785/testReport/junit.framework/TestSuite/org_apache_solr_search_LargeFieldTest/
          Hide
          dsmiley David Smiley added a comment -

          Thanks for bringing this to my attention. Hmm.... a quick try from my IDE on my Mac with this seed failed to reproduce. I'll try from the CLI. This failure on Jenkins was a Windows machine and that seems relevant. Is your machine also Windows Yonik? Judging from the stack trace, it appears the solution could be as simple as removing the '?' in the 2nd argument to initCore.... I'm investigating.

          Show
          dsmiley David Smiley added a comment - Thanks for bringing this to my attention. Hmm.... a quick try from my IDE on my Mac with this seed failed to reproduce. I'll try from the CLI. This failure on Jenkins was a Windows machine and that seems relevant. Is your machine also Windows Yonik? Judging from the stack trace, it appears the solution could be as simple as removing the '?' in the 2nd argument to initCore.... I'm investigating.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Is your machine also Windows Yonik?

          Nope, it was Ubuntu 16.04
          I'll see if the logs for that failure are still around...

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Is your machine also Windows Yonik? Nope, it was Ubuntu 16.04 I'll see if the logs for that failure are still around...
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit e7d9db9d6c4dccc158b52d53584ead93b7f55c38 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e7d9db9 ]

          SOLR-10286: fix test for Windows

          Show
          jira-bot ASF subversion and git services added a comment - Commit e7d9db9d6c4dccc158b52d53584ead93b7f55c38 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e7d9db9 ] SOLR-10286 : fix test for Windows
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 94a3066dcda0a236a761c28930fbade62b55e561 in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=94a3066 ]

          SOLR-10286: fix test for Windows

          (cherry picked from commit e7d9db9)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 94a3066dcda0a236a761c28930fbade62b55e561 in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=94a3066 ] SOLR-10286 : fix test for Windows (cherry picked from commit e7d9db9)
          Hide
          dsmiley David Smiley added a comment -

          Marking closed towards v6.5. If there are bugs/tests that show a problem then new issue(s) need to be filed.

          Show
          dsmiley David Smiley added a comment - Marking closed towards v6.5. If there are bugs/tests that show a problem then new issue(s) need to be filed.

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              dsmiley David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development