Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8740

set docValues="true" for most non-text fieldTypes in the sample schemas

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None

      Description

      We should consider switching to docValues for most of the non-text fields in the sample schemas provided with solr.

      This may be a better default since it is more NRT friendly and acts to avoid OOM errors due to large field cache or UnInvertedField entries.

      1. SOLR-8740.patch
        21 kB
        Yonik Seeley
      2. SOLR-8740.patch
        6 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          tomasflobbe Tomás Fernández Löbbe added a comment -

          +1

          Show
          tomasflobbe Tomás Fernández Löbbe added a comment - +1
          Hide
          jkrupan Jack Krupansky added a comment -

          And default to docValuesFormat="Memory" as well, or is that already the default when docValues="true" is set?

          Personally, I still find the whole docValues vs. Stored fields narrative extremely confusing. I've never been able to figure out why Lucene still needs Stored fields (other than for tokenized text fields) if docValues is so much better.

          In any case, with this Jira in place, there should be clear doc as to what scenarios, if any, stored="true" might have any utility for non-tokenized/text fields.

          Show
          jkrupan Jack Krupansky added a comment - And default to docValuesFormat="Memory" as well, or is that already the default when docValues="true" is set? Personally, I still find the whole docValues vs. Stored fields narrative extremely confusing. I've never been able to figure out why Lucene still needs Stored fields (other than for tokenized text fields) if docValues is so much better. In any case, with this Jira in place, there should be clear doc as to what scenarios, if any, stored="true" might have any utility for non-tokenized/text fields.
          Hide
          varunthacker Varun Thacker added a comment -

          +1 . Too many new users still don't know about docValues and when their index grows large and they start seeing memory pressure issues re-indexing is a big pain.

          Show
          varunthacker Varun Thacker added a comment - +1 . Too many new users still don't know about docValues and when their index grows large and they start seeing memory pressure issues re-indexing is a big pain.
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          In the same breath, now that non-stored docValues fields can be accessed through the docValues api, I think the version field should be non-stored dv going forward. SOLR-6337

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - In the same breath, now that non-stored docValues fields can be accessed through the docValues api, I think the version field should be non-stored dv going forward. SOLR-6337
          Hide
          jpountz Adrien Grand added a comment -

          And default to docValuesFormat="Memory" as well, or is that already the default when docValues="true" is set?

          Having the default setup not using the default codec looks dangerous to me as it means that users won't be able to upgrade clusters without switching back to the default codec first (which is the only supported one for backwards compatibility).

          I've never been able to figure out why Lucene still needs Stored fields (other than for tokenized text fields) if docValues is so much better.

          Doc values are not better, they just have different trade-offs: stored fields are optimized for randomly getting several values from a couple dozen documents while doc values are optimized for sequentially reading a couple values from many documents. If you were to replace stored fields with doc values, performance would become horrible if your index is significantly larger than your filesystem cache, especially if you have spinning disks. I suspect it could be fine if it was only done for the version field as suggested above but doing it for all fields sounds dangerous to me.

          Show
          jpountz Adrien Grand added a comment - And default to docValuesFormat="Memory" as well, or is that already the default when docValues="true" is set? Having the default setup not using the default codec looks dangerous to me as it means that users won't be able to upgrade clusters without switching back to the default codec first (which is the only supported one for backwards compatibility). I've never been able to figure out why Lucene still needs Stored fields (other than for tokenized text fields) if docValues is so much better. Doc values are not better, they just have different trade-offs: stored fields are optimized for randomly getting several values from a couple dozen documents while doc values are optimized for sequentially reading a couple values from many documents. If you were to replace stored fields with doc values, performance would become horrible if your index is significantly larger than your filesystem cache, especially if you have spinning disks. I suspect it could be fine if it was only done for the version field as suggested above but doing it for all fields sounds dangerous to me.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          And default to docValuesFormat="Memory" as well, or is that already the default when docValues="true" is set?

          The ability to keep everything off-heap and accessed via memory map seems to be the default we want (i.e.the docValues default of "disk").
          Sorting and faceting on docValues fields will most likely be a little slower (ignoring NRT).... but the real benefit is having things "just work" and avoiding the dreaded "I got an OOM exception when I tried to sort on this field" issues. So we're going to want them by default on non-text fields that people will use for sorting, faceting, stats, etc.

          I'll try and make some time to tackle this issue soon (sometime after the Lucene/Solr NYC meetup on Wednesday)

          Show
          yseeley@gmail.com Yonik Seeley added a comment - And default to docValuesFormat="Memory" as well, or is that already the default when docValues="true" is set? The ability to keep everything off-heap and accessed via memory map seems to be the default we want (i.e.the docValues default of "disk"). Sorting and faceting on docValues fields will most likely be a little slower (ignoring NRT).... but the real benefit is having things "just work" and avoiding the dreaded "I got an OOM exception when I tried to sort on this field" issues. So we're going to want them by default on non-text fields that people will use for sorting, faceting, stats, etc. I'll try and make some time to tackle this issue soon (sometime after the Lucene/Solr NYC meetup on Wednesday)
          Hide
          simon.rosenthal Simon Rosenthal added a comment -

          If this is adopted, it needs to be clearly documented that DocValues do not retain ordering in multivalued fields whereas stored fields do. Our use case - picking first and last authors from a multivalued 'authors' String field.

          Show
          simon.rosenthal Simon Rosenthal added a comment - If this is adopted, it needs to be clearly documented that DocValues do not retain ordering in multivalued fields whereas stored fields do. Our use case - picking first and last authors from a multivalued 'authors' String field.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          This issue is really only about changing what our starting schemas look like I think, and won't change any behavior. That's already been done in other issues.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - This issue is really only about changing what our starting schemas look like I think, and won't change any behavior. That's already been done in other issues.
          Hide
          varunthacker Varun Thacker added a comment -

          Can we please not mix up the issues.

          The issue here which Yonik created is about making docValues as default. So even if you don't mark a field explicitly with docValues="true" in your schema , Solr will enable it. You have to explicitly turn it off instead. This does not change any behaviour on how fields are returned etc.

          Discussions which started with I still find the whole docValues vs. Stored fields narrative extremely confusing. and any followups on this are tangential to this issue. Solr now supports two way to return back the actual field values. SOLR-8220 has all the details and there is a very detailed entry about it in the CHANGES file. Can we please have any followup discussions on that on another mailing list discussion/SOLR-8220 Jira

          * The Solr schema version has been increased to 1.6. Since schema version 1.6, all non-stored docValues fields
            will be returned along with other stored fields when all fields (or pattern matching globs) are specified
            to be returned (e.g. fl=*) for search queries. This behavior can be turned on and off by setting
            'useDocValuesAsStored' parameter for a field or a field type to true (default since schema version 1.6)
            or false (default till schema version 1.5).
            Note that enabling this property has performance implications because DocValues are column-oriented and may
            therefore incur additional cost to retrieve for each returned document. All example schema are upgraded to
            version 1.6 but any older schemas will default to useDocValuesAsStored=false and continue to work as in
            older versions of Solr. If this new behavior is desirable, then you should set version attribute in your
            schema file to '1.6'. Re-indexing is not necessary to upgrade the schema version.
            Also note that while returning non-stored fields from docValues (default in schema versions 1.6+, unless
            useDocValuesAsStored is false), the values of a multi-valued field are returned in sorted order.
            If you require the multi-valued fields to be returned in the original insertion order, then make your
            multi-valued field as stored. This requires re-indexing.
            See SOLR-8220 for more details.
          
          Show
          varunthacker Varun Thacker added a comment - Can we please not mix up the issues. The issue here which Yonik created is about making docValues as default. So even if you don't mark a field explicitly with docValues="true" in your schema , Solr will enable it. You have to explicitly turn it off instead. This does not change any behaviour on how fields are returned etc. Discussions which started with I still find the whole docValues vs. Stored fields narrative extremely confusing. and any followups on this are tangential to this issue. Solr now supports two way to return back the actual field values. SOLR-8220 has all the details and there is a very detailed entry about it in the CHANGES file. Can we please have any followup discussions on that on another mailing list discussion/ SOLR-8220 Jira * The Solr schema version has been increased to 1.6. Since schema version 1.6, all non-stored docValues fields will be returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g. fl=*) for search queries. This behavior can be turned on and off by setting 'useDocValuesAsStored' parameter for a field or a field type to true ( default since schema version 1.6) or false ( default till schema version 1.5). Note that enabling this property has performance implications because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. All example schema are upgraded to version 1.6 but any older schemas will default to useDocValuesAsStored= false and continue to work as in older versions of Solr. If this new behavior is desirable, then you should set version attribute in your schema file to '1.6'. Re-indexing is not necessary to upgrade the schema version. Also note that while returning non-stored fields from docValues ( default in schema versions 1.6+, unless useDocValuesAsStored is false ), the values of a multi-valued field are returned in sorted order. If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored. This requires re-indexing. See SOLR-8220 for more details.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          The issue here which Yonik created is about making docValues as default. So even if you don't mark a field explicitly with docValues="true" in your schema , Solr will enable it.

          Oh, hmmm, that's a good point. I hadn't actually meant that. But I now realize that the issue title/description is somewhat ambiguous.
          As an example, I was thinking of things like changing the *_i dynamic field in our schema templates to explicitly include docValues="true" by default (or setting it on the fieldType if that's a flag that carries over to all of it's fields by default).

          We could consider changing the default such that even if it doesn't appear in the schema, docValues would be true (and bump the schema version of course). I had assumed that might be more controversial though, so was just looking to effectively change the de-facto defaults.

          If we can set docValues=true on a fieldType and have it carry over to all fields by default, then that's even friendly for fields added via API or guessed fields.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - The issue here which Yonik created is about making docValues as default. So even if you don't mark a field explicitly with docValues="true" in your schema , Solr will enable it. Oh, hmmm, that's a good point. I hadn't actually meant that. But I now realize that the issue title/description is somewhat ambiguous. As an example, I was thinking of things like changing the *_i dynamic field in our schema templates to explicitly include docValues="true" by default (or setting it on the fieldType if that's a flag that carries over to all of it's fields by default). We could consider changing the default such that even if it doesn't appear in the schema, docValues would be true (and bump the schema version of course). I had assumed that might be more controversial though, so was just looking to effectively change the de-facto defaults. If we can set docValues=true on a fieldType and have it carry over to all fields by default, then that's even friendly for fields added via API or guessed fields.
          Hide
          jkrupan Jack Krupansky added a comment -

          My apologies for any unnecessary noise I may have caused here. I just think that every single docValues issue raised for Solr should endeavor to make the lives of Solr users a lot easier, not more complicated and even more confusing. As things stand, docValues is more of an expert-only feature. The mere fact that we can't make docValues uniformly the default illustrates that in spades.

          Show
          jkrupan Jack Krupansky added a comment - My apologies for any unnecessary noise I may have caused here. I just think that every single docValues issue raised for Solr should endeavor to make the lives of Solr users a lot easier, not more complicated and even more confusing. As things stand, docValues is more of an expert-only feature. The mere fact that we can't make docValues uniformly the default illustrates that in spades.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          +1

          I think this ticket is a pretty important improvement because it means that the /export handler (requires docValues) will work out of the box. The /export handler is needed for many features in Streaming Expressions and Paralllel SQL. So having docValues on by default makes these feature easier to use.

          Show
          joel.bernstein Joel Bernstein added a comment - +1 I think this ticket is a pretty important improvement because it means that the /export handler (requires docValues) will work out of the box. The /export handler is needed for many features in Streaming Expressions and Paralllel SQL. So having docValues on by default makes these feature easier to use.
          Hide
          erickerickson Erick Erickson added a comment -

          Tell you what, I'll either write or make sure a test exists that illustrates my concern and in any case put in a big fat warning in the code to document if the change if it suddenly starts failing.

          Show
          erickerickson Erick Erickson added a comment - Tell you what, I'll either write or make sure a test exists that illustrates my concern and in any case put in a big fat warning in the code to document if the change if it suddenly starts failing.
          Hide
          tomasflobbe Tomás Fernández Löbbe added a comment -

          Also, if we add PointFields (SOLR-8396), we'll need docValues for sorting/faceting on them. A basic/unoptimized schema would allow you to sort or facet on any of those fields as you can do now with TrieFields and the FieldCache/FieldValueCache. Later you can optimize the schema by removing DV from the fields where you don't need them. I believe that would be a better starting experience than not adding DV and giving errors.

          I was thinking of things like changing the *_i dynamic field in our schema templates to explicitly include docValues="true" by default (or setting it on the fieldType if that's a flag that carries over to all of it's fields by default).

          We could consider changing the default such that even if it doesn't appear in the schema, docValues would be true (and bump the schema version of course). I had assumed that might be more controversial though, so was just looking to effectively change the de-facto defaults.

          I'm OK either way, I do think in any case we should leave the docValues="true" in the schema to make it more obvious that you want to look at that attribute for optimizing your schema.

          Show
          tomasflobbe Tomás Fernández Löbbe added a comment - Also, if we add PointFields ( SOLR-8396 ), we'll need docValues for sorting/faceting on them. A basic/unoptimized schema would allow you to sort or facet on any of those fields as you can do now with TrieFields and the FieldCache/FieldValueCache. Later you can optimize the schema by removing DV from the fields where you don't need them. I believe that would be a better starting experience than not adding DV and giving errors. I was thinking of things like changing the *_i dynamic field in our schema templates to explicitly include docValues="true" by default (or setting it on the fieldType if that's a flag that carries over to all of it's fields by default). We could consider changing the default such that even if it doesn't appear in the schema, docValues would be true (and bump the schema version of course). I had assumed that might be more controversial though, so was just looking to effectively change the de-facto defaults. I'm OK either way, I do think in any case we should leave the docValues="true" in the schema to make it more obvious that you want to look at that attribute for optimizing your schema.
          Hide
          arafalov Alexandre Rafalovitch added a comment -

          We could consider changing the default such that even if it doesn't appear in the schema, docValues would be true (and bump the schema version of course). I had assumed that might be more controversial though, so was just looking to effectively change the de-facto defaults.

          At the moment, both indexed and stored flags are false if not set, I believe. making docValues to be true if not set, could cause confusions. So, I am with Yonik's original suggestion of having them enabled as explicit flags in the example schemas.

          Show
          arafalov Alexandre Rafalovitch added a comment - We could consider changing the default such that even if it doesn't appear in the schema, docValues would be true (and bump the schema version of course). I had assumed that might be more controversial though, so was just looking to effectively change the de-facto defaults. At the moment, both indexed and stored flags are false if not set, I believe. making docValues to be true if not set, could cause confusions. So, I am with Yonik's original suggestion of having them enabled as explicit flags in the example schemas.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          There are tradeoffs. They won't be shoved on users though. They can always switch back. Given the history of users with OOM and the fieldcache, I think we are making a much better user experience default.

          Show
          markrmiller@gmail.com Mark Miller added a comment - There are tradeoffs. They won't be shoved on users though. They can always switch back. Given the history of users with OOM and the fieldcache, I think we are making a much better user experience default.
          Hide
          elyograg Shawn Heisey added a comment -

          At the moment, both indexed and stored flags are false if not set, I believe.

          The default values for stored and indexed are true for all schema versions. See this method in FieldType.java, line 153 in particular:

          https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=solr/core/src/java/org/apache/solr/schema/FieldType.java;h=fbcebcda1b163dd1ba2d9980e1cfd540b2769f13;hb=12f7ad66963a5ae784f2bd0bf8b5dbc4b3c1630e#l150

          I think setting the default for docValues to true in schema version 1.7 is probably a good idea, but I do predict a lot of "I upgraded and now my index is twice as big!" messages on the list.

          Show
          elyograg Shawn Heisey added a comment - At the moment, both indexed and stored flags are false if not set, I believe. The default values for stored and indexed are true for all schema versions. See this method in FieldType.java, line 153 in particular: https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=solr/core/src/java/org/apache/solr/schema/FieldType.java;h=fbcebcda1b163dd1ba2d9980e1cfd540b2769f13;hb=12f7ad66963a5ae784f2bd0bf8b5dbc4b3c1630e#l150 I think setting the default for docValues to true in schema version 1.7 is probably a good idea, but I do predict a lot of "I upgraded and now my index is twice as big!" messages on the list.
          Hide
          elyograg Shawn Heisey added a comment -

          On further consideration, if the schema version is explicitly stated, I guess that won't happen.

          Show
          elyograg Shawn Heisey added a comment - On further consideration, if the schema version is explicitly stated, I guess that won't happen.
          Hide
          erickerickson Erick Erickson added a comment -

          I added some tests that should serve to flag if the ordering of MV fields is different so we can stop discussing it See SOLR-8813

          Show
          erickerickson Erick Erickson added a comment - I added some tests that should serve to flag if the ordering of MV fields is different so we can stop discussing it See SOLR-8813
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Here's a draft patch for "basic" schema only at this point...

          • boolean fields don't support docValues
          • I don't think LatLonType supports docValues in the main field
          • Not sure if some of the other spatial types (bbox) support DocValues for the main field or not. But perhaps these are on the road to deprecation due to the new point stuff in lucene?
          • in some preliminary performance tests with unstored DV fields, /export was much faster (7 times faster) than /select - so at this point I've left the fields as stored as well until performance issues can be addressed.
          Show
          yseeley@gmail.com Yonik Seeley added a comment - Here's a draft patch for "basic" schema only at this point... boolean fields don't support docValues I don't think LatLonType supports docValues in the main field Not sure if some of the other spatial types (bbox) support DocValues for the main field or not. But perhaps these are on the road to deprecation due to the new point stuff in lucene? in some preliminary performance tests with unstored DV fields, /export was much faster (7 times faster) than /select - so at this point I've left the fields as stored as well until performance issues can be addressed.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Here's an update w/ the other template schemas converted to use docValues.
          I'll do some more testing to validate that things work and then commit.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Here's an update w/ the other template schemas converted to use docValues. I'll do some more testing to validate that things work and then commit.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit e76fa568172173feeed3eaaf7de06b773b32605d in lucene-solr's branch refs/heads/master from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e76fa56 ]

          SOLR-8740: use docValues for non-text fields in schema templates

          Show
          jira-bot ASF subversion and git services added a comment - Commit e76fa568172173feeed3eaaf7de06b773b32605d in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e76fa56 ] SOLR-8740 : use docValues for non-text fields in schema templates
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 14752476f445436944618a6f1dde9bd787a1f3c9 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1475247 ]

          SOLR-8740: use docValues for non-text fields in schema templates

          Show
          jira-bot ASF subversion and git services added a comment - Commit 14752476f445436944618a6f1dde9bd787a1f3c9 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1475247 ] SOLR-8740 : use docValues for non-text fields in schema templates
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 97361313582d054ff44a2a3a8e206f313e67d68c in lucene-solr's branch refs/heads/branch_6_0 from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9736131 ]

          SOLR-8740: use docValues for non-text fields in schema templates

          Show
          jira-bot ASF subversion and git services added a comment - Commit 97361313582d054ff44a2a3a8e206f313e67d68c in lucene-solr's branch refs/heads/branch_6_0 from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9736131 ] SOLR-8740 : use docValues for non-text fields in schema templates
          Hide
          hossman Hoss Man added a comment -

          updating summary & description to reflect the reality of what was changed....

          For fields & fieldtypes that do not explicitly declare docValues attribute, all existing solr FieldTypes still default to false, but in all of the same schema files provided with solr most non-text <fieldType /> declarations have been modified to include a docValues="true" attribute

          Show
          hossman Hoss Man added a comment - updating summary & description to reflect the reality of what was changed.... For fields & fieldtypes that do not explicitly declare docValues attribute, all existing solr FieldTypes still default to false, but in all of the same schema files provided with solr most non-text <fieldType /> declarations have been modified to include a docValues="true" attribute
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          Yonik Seeley, in the Solr 6.0 release notes, the following was mentioned:

          Users should set useDocValuesAsStored="false" to preserve sort order on multi-valued
              fields that have both stored="true" and docValues="true". 
          

          Given that useDocValuesAsStored doesn't currently affect fields with stored=true, do you think this note was appropriately worded? I agree that in future, we might have useDocValuesAsStored affect how we retrieve stored=true fields, but whether that will affect the ordering is still not known (since we can, in theory, use another docValues format that preserves the original order). Though, even if future optimizations are potentially breaking for a user, the note mentioned above implies something that needs to be done for avoiding a current issue, not a future issue.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - Yonik Seeley , in the Solr 6.0 release notes, the following was mentioned: Users should set useDocValuesAsStored= " false " to preserve sort order on multi-valued fields that have both stored= " true " and docValues= " true " . Given that useDocValuesAsStored doesn't currently affect fields with stored=true, do you think this note was appropriately worded? I agree that in future, we might have useDocValuesAsStored affect how we retrieve stored=true fields, but whether that will affect the ordering is still not known (since we can, in theory, use another docValues format that preserves the original order). Though, even if future optimizations are potentially breaking for a user, the note mentioned above implies something that needs to be done for avoiding a current issue, not a future issue.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          I think it's a good idea... there is a lot of performance to be gained out of using docValues when returning a couple of docValue fields for many documents (like the export handler currently does). Although we can (and should at some point) have docValue implementations that preserve order, we don't have them yet... and the note leaves the door open to implementing optimizations before the next major release.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - I think it's a good idea... there is a lot of performance to be gained out of using docValues when returning a couple of docValue fields for many documents (like the export handler currently does). Although we can (and should at some point) have docValue implementations that preserve order, we don't have them yet... and the note leaves the door open to implementing optimizations before the next major release.

            People

            • Assignee:
              Unassigned
              Reporter:
              yseeley@gmail.com Yonik Seeley
            • Votes:
              3 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development