Solr
  1. Solr
  2. SOLR-8220

Read field from docValues for non stored fields

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Many times a value will be both stored="true" and docValues="true" which requires redundant data to be stored on disk. Since reading from docValues is both efficient and a common practice (facets, analytics, streaming, etc), reading values from docValues when a stored version of the field does not exist would be a valuable disk usage optimization.

      The only caveat with this that I can see would be for multiValued fields as they would always be returned sorted in the docValues approach. I believe this is a fair compromise.

      I've done a rough implementation for this as a field transform, but I think it should live closer to where stored fields are loaded in the SolrIndexSearcher.

      Two open questions/observations:

      1) There doesn't seem to be a standard way to read values for docValues, facets, analytics, streaming, etc, all seem to be doing their own ways, perhaps some of this logic should be centralized.

      2) What will the API behavior be? (Below is my proposed implementation)
      Parameters for fl:

      • fl="docValueField"
        • return field from docValue if the field is not stored and in docValues, if the field is stored return it from stored fields
      • fl="*"
        • return only stored fields
      • fl="+"
        • return stored fields and docValue fields

      2a - would be easiest implementation and might be sufficient for a first pass. 2b - is current behavior

      1. SOLR-8220.patch
        110 kB
        Shalin Shekhar Mangar
      2. SOLR-8220.patch
        111 kB
        Shalin Shekhar Mangar
      3. SOLR-8220.patch
        104 kB
        Ishan Chattopadhyaya
      4. SOLR-8220.patch
        104 kB
        Ishan Chattopadhyaya
      5. SOLR-8220.patch
        107 kB
        Shalin Shekhar Mangar
      6. SOLR-8220.patch
        100 kB
        Ishan Chattopadhyaya
      7. SOLR-8220.patch
        97 kB
        Shalin Shekhar Mangar
      8. SOLR-8220.patch
        97 kB
        Shalin Shekhar Mangar
      9. SOLR-8220.patch
        97 kB
        Shalin Shekhar Mangar
      10. SOLR-8220.patch
        82 kB
        Ishan Chattopadhyaya
      11. SOLR-8220.patch
        81 kB
        Ishan Chattopadhyaya
      12. SOLR-8220.patch
        82 kB
        Ishan Chattopadhyaya
      13. SOLR-8220.patch
        75 kB
        Ishan Chattopadhyaya
      14. SOLR-8220.patch
        73 kB
        Ishan Chattopadhyaya
      15. SOLR-8220.patch
        23 kB
        Ishan Chattopadhyaya
      16. SOLR-8220.patch
        23 kB
        Ishan Chattopadhyaya
      17. SOLR-8220.patch
        23 kB
        Keith Laban
      18. SOLR-8220.patch
        23 kB
        Keith Laban
      19. SOLR-8220.patch
        23 kB
        Keith Laban
      20. SOLR-8220.patch
        18 kB
        Ishan Chattopadhyaya
      21. SOLR-8220.patch
        18 kB
        Ishan Chattopadhyaya
      22. SOLR-8220.patch
        18 kB
        Ishan Chattopadhyaya
      23. SOLR-8220.patch
        18 kB
        Keith Laban
      24. SOLR-8220.patch
        20 kB
        Ishan Chattopadhyaya
      25. SOLR-8220.patch
        21 kB
        Keith Laban
      26. SOLR-8220.patch
        21 kB
        Keith Laban
      27. SOLR-8220.patch
        8 kB
        Ishan Chattopadhyaya
      28. SOLR-8220.patch
        9 kB
        Keith Laban
      29. SOLR-8220-5x.patch
        70 kB
        Ishan Chattopadhyaya
      30. SOLR-8220-branch_5x.patch
        110 kB
        Shalin Shekhar Mangar
      31. SOLR-8220-ishan.patch
        18 kB
        Ishan Chattopadhyaya
      32. SOLR-8220-ishan.patch
        18 kB
        Ishan Chattopadhyaya
      33. SOLR-8220-ishan.patch
        17 kB
        Ishan Chattopadhyaya
      34. SOLR-8220-ishan.patch
        16 kB
        Ishan Chattopadhyaya

        Issue Links

          Activity

          Hide
          Erick Erickson added a comment -

          The only surprising behavior I see with this approach is that indexing 2.0 would return different values when returned from docValues and when returned from stored, the DV return might be something like 1.9999999999999999 or even 2.0000000000000 would be a "surprise".

          I'm not against the idea, just want this out there.

          And one side benefit that's not entirely obvious. In sharded situations, the first pass returns the candidate list ID and "sort criteria". The way it's written last I knew was it returned stored values, which required decompression because it gets the stored field. If all the sort fields were DV, then we wouldn't have to do this.

          This can't be the complete story since you can index but not store a sort field and distributed works, but it's one path I believe I've seen. It's an open question how to wire that in to standard search for a field that's stored, and a DV field.

          Show
          Erick Erickson added a comment - The only surprising behavior I see with this approach is that indexing 2.0 would return different values when returned from docValues and when returned from stored, the DV return might be something like 1.9999999999999999 or even 2.0000000000000 would be a "surprise". I'm not against the idea, just want this out there. And one side benefit that's not entirely obvious. In sharded situations, the first pass returns the candidate list ID and "sort criteria". The way it's written last I knew was it returned stored values, which required decompression because it gets the stored field. If all the sort fields were DV, then we wouldn't have to do this. This can't be the complete story since you can index but not store a sort field and distributed works, but it's one path I believe I've seen. It's an open question how to wire that in to standard search for a field that's stored, and a DV field.
          Hide
          Yonik Seeley added a comment -

          reading values from docValues when a stored version of the field does not exist would be a valuable disk usage optimization.

          +1, and I've heard a number of users request this.

          1) There doesn't seem to be a standard way to read values for docValues, facets, analytics, streaming, etc, all seem to be doing their own ways, perhaps some of this logic should be centralized.

          See ReturnFields / ResultContext, that's currently where stored field handling is centralized, and handles anywhere a field list (or pseudo-fields / transformers) is specified.

          +1 for 2a as a first pass.
          For bonus points, prevent stored fields from being loaded at all when not needed. This gets us a big step closer to having the normal request handler have the same performance as "/export".

          Looking beyond the first pass, it might be nice to use docValues as more of a first-class alternate "stored" mechanism, and consider them part of "*". If for some reason it's desirable to treat some docValues fields as stored, and others not, we could introduce a flag on <field> in the schema.

          The only caveat with this that I can see would be for multiValued fields as they would always be returned sorted in the docValues approach. I believe this is a fair compromise.

          This shouldn't be much of a concern for approach 2a, but another future option would be to add explicit set types, and also implement list-type multi-valued docValues fields... prob using binary docValues under the covers).

          Show
          Yonik Seeley added a comment - reading values from docValues when a stored version of the field does not exist would be a valuable disk usage optimization. +1, and I've heard a number of users request this. 1) There doesn't seem to be a standard way to read values for docValues, facets, analytics, streaming, etc, all seem to be doing their own ways, perhaps some of this logic should be centralized. See ReturnFields / ResultContext, that's currently where stored field handling is centralized, and handles anywhere a field list (or pseudo-fields / transformers) is specified. +1 for 2a as a first pass. For bonus points, prevent stored fields from being loaded at all when not needed. This gets us a big step closer to having the normal request handler have the same performance as "/export". Looking beyond the first pass, it might be nice to use docValues as more of a first-class alternate "stored" mechanism, and consider them part of "*". If for some reason it's desirable to treat some docValues fields as stored, and others not, we could introduce a flag on <field> in the schema. The only caveat with this that I can see would be for multiValued fields as they would always be returned sorted in the docValues approach. I believe this is a fair compromise. This shouldn't be much of a concern for approach 2a, but another future option would be to add explicit set types, and also implement list-type multi-valued docValues fields... prob using binary docValues under the covers).
          Hide
          Yonik Seeley added a comment -

          The way it's written last I knew was it returned stored values, which required decompression because it gets the stored field.

          Right, we should change this (and this issue may handle that out-of-the-box, if we chose not to store the "id" field).

          If all the sort fields were DV, then we wouldn't have to do this.

          The sort field values returned in the first phase of distributed search aren't obtained from stored field values.

          Show
          Yonik Seeley added a comment - The way it's written last I knew was it returned stored values, which required decompression because it gets the stored field. Right, we should change this (and this issue may handle that out-of-the-box, if we chose not to store the "id" field). If all the sort fields were DV, then we wouldn't have to do this. The sort field values returned in the first phase of distributed search aren't obtained from stored field values.
          Hide
          Erick Erickson added a comment -

          I'm not talking about sorting here. The sorting during scoring certainly uses DV values.

          Here's the sequence (two shards, no followers, rows=10, illustrating with just the action on shard2)
          1> shard1 gets the incoming request
          2> shard1 sends a sub-request for candidate top 10 to shard2
          3> shard2 returns its candidate top 10 to shard1. This return packet has the top 10 doc IDs and sort criteria.
          4> shard1 combines the lists of candidate top 10 docs and picks the true top 10
          5> shard1 asks shard2 for the docs that came from shard2 and made it into the sorted top 10 list.

          What I'm talking about is step <3>. Last I knew, this did not use the DV fields, but pulled the stored value thus decompressing.

          I'm not sure how this is resolved for fields that aren't stored, there must be a fallback. Or I'm missing something here, wouldn't be the first time. Maybe Yonik's idea of making DV fields more "first class citizens" would just take care of the issue entirely.

          Show
          Erick Erickson added a comment - I'm not talking about sorting here. The sorting during scoring certainly uses DV values. Here's the sequence (two shards, no followers, rows=10, illustrating with just the action on shard2) 1> shard1 gets the incoming request 2> shard1 sends a sub-request for candidate top 10 to shard2 3> shard2 returns its candidate top 10 to shard1. This return packet has the top 10 doc IDs and sort criteria. 4> shard1 combines the lists of candidate top 10 docs and picks the true top 10 5> shard1 asks shard2 for the docs that came from shard2 and made it into the sorted top 10 list. What I'm talking about is step <3>. Last I knew, this did not use the DV fields, but pulled the stored value thus decompressing. I'm not sure how this is resolved for fields that aren't stored, there must be a fallback. Or I'm missing something here, wouldn't be the first time. Maybe Yonik's idea of making DV fields more "first class citizens" would just take care of the issue entirely.
          Hide
          Erick Erickson added a comment - - edited

          bq: The sort field values returned in the first phase of distributed search aren't obtained from stored field values.

          Thanks, I suspect I was inappropriately generalizing from the <uniqueKey>..... Which answers how sort values get returned from non-stored fields....

          I suspect that practically it doesn't matter since to get a single field you have to decompress a 16K block, so getting the ID to be fetched from the DV field should be a win...

          Show
          Erick Erickson added a comment - - edited bq: The sort field values returned in the first phase of distributed search aren't obtained from stored field values. Thanks, I suspect I was inappropriately generalizing from the <uniqueKey>..... Which answers how sort values get returned from non-stored fields.... I suspect that practically it doesn't matter since to get a single field you have to decompress a 16K block, so getting the ID to be fetched from the DV field should be a win...
          Hide
          Yonik Seeley added a comment -

          3> shard2 returns its candidate top 10 to shard1. This return packet has the top 10 doc IDs and sort criteria.

          We currently pull the sort criteria from the field comparators implementing the sort:
          https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L638

          Show
          Yonik Seeley added a comment - 3> shard2 returns its candidate top 10 to shard1. This return packet has the top 10 doc IDs and sort criteria. We currently pull the sort criteria from the field comparators implementing the sort: https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L638
          Hide
          David Smiley added a comment -

          FYI I have a patch in SOLR-5478, and I've done it without patching Solr as well using a different technique.

          Show
          David Smiley added a comment - FYI I have a patch in SOLR-5478 , and I've done it without patching Solr as well using a different technique.
          Hide
          Ishan Chattopadhyaya added a comment -

          +1 to doing this. I think this will be useful for SOLR-5944, and was anyway planning to split this functionality out into its own issue. Also, if we go forward with _version_ field as docvalues field, then this becomes important. In current Solr, the way to read non-stored docValues fields is to use a function query, field(mydvfield).

          Keith Laban Are you planning to work on this / have a patch for this? If not, then I can give it a try and have SOLR-5944 depend on it.

          Show
          Ishan Chattopadhyaya added a comment - +1 to doing this. I think this will be useful for SOLR-5944 , and was anyway planning to split this functionality out into its own issue. Also, if we go forward with _version_ field as docvalues field, then this becomes important. In current Solr, the way to read non-stored docValues fields is to use a function query, field(mydvfield). Keith Laban Are you planning to work on this / have a patch for this? If not, then I can give it a try and have SOLR-5944 depend on it.
          Hide
          Yonik Seeley added a comment - - edited

          +1 to doing this. I think this will be useful for SOLR-5944, and was anyway planning to split this functionality out into its own issue.

          Ah, right, atomic updates needs this functionality as well if we are to allow docValues fields that aren't stored.
          In that case I'll amend my previous comments around ResultContext... that's appropriate for decorating documents as they are being returned, but perhaps not low enough level for other use cases.

          edit: I'm basically agreeing with Keith's original observation - "I think it should live closer to where stored fields are loaded in the SolrIndexSearcher."

          Show
          Yonik Seeley added a comment - - edited +1 to doing this. I think this will be useful for SOLR-5944 , and was anyway planning to split this functionality out into its own issue. Ah, right, atomic updates needs this functionality as well if we are to allow docValues fields that aren't stored. In that case I'll amend my previous comments around ResultContext... that's appropriate for decorating documents as they are being returned, but perhaps not low enough level for other use cases. edit: I'm basically agreeing with Keith's original observation - "I think it should live closer to where stored fields are loaded in the SolrIndexSearcher."
          Hide
          Keith Laban added a comment - - edited

          There are two approaches I can see:

          1) implement a new type of StoredFieldReader which is aware of field type (i.e. does it have docValues, is it stored). This reader would delegate between reading from docValues or the stored fields

          2) in the SolrIndexSearcher.doc function do two passes. first pass to get stored fields, a second pass to get docValues. This can go a step further to make the SetNonLazyFieldSelector aware of docValues fields and instruct the the reader to not load fields which are known to be in docValues.

          thoughts?

          edit: Would approach number 2 effect how fields are loaded lazyily (LazyDocument)?

          Show
          Keith Laban added a comment - - edited There are two approaches I can see: 1) implement a new type of StoredFieldReader which is aware of field type (i.e. does it have docValues, is it stored). This reader would delegate between reading from docValues or the stored fields 2) in the SolrIndexSearcher.doc function do two passes. first pass to get stored fields, a second pass to get docValues. This can go a step further to make the SetNonLazyFieldSelector aware of docValues fields and instruct the the reader to not load fields which are known to be in docValues. thoughts? edit: Would approach number 2 effect how fields are loaded lazyily (LazyDocument)?
          Hide
          Keith Laban added a comment -

          adding an initial attempt at this patch. What does everyone think about taking an approach like this?

          This patch will decorate a document with docValue values after the stored values have been read. For now it skips multivalued fields and only reads from docValues if FL is specified and the field is not stored.

          Some problems I noticed with this approach are:

          1) LazyDocument doesn't support a notion of loading from docValues, without
          mucking around in there I can't see a way apply the docValues before caching
          because of various FLs.

          2) There is no metadata (that I can find) stored for each document that says
          whether it has an unstored docValue field, so efficiently loading docValues
          fields based on FL=* would be difficult. The only possibly way right now is to
          iterate over all schema fields looking for the viable docValue fields for each
          matched document.

          3) More of a question: What kind of FieldType should be created when adding
          these docValues to the document?

          I have an alternate patch that attempts to preload stored values from docValues
          before handing over to the IndexReader. Those fields are then skipped later
          on. It passes tests, but it's not a very elegent approach. And also has
          its limitations; it only works when FL is specified and it wouldn't
          work on the cached hit of the document if a field is loaded with LazyDocument.

          Show
          Keith Laban added a comment - adding an initial attempt at this patch. What does everyone think about taking an approach like this? This patch will decorate a document with docValue values after the stored values have been read. For now it skips multivalued fields and only reads from docValues if FL is specified and the field is not stored. Some problems I noticed with this approach are: 1) LazyDocument doesn't support a notion of loading from docValues, without mucking around in there I can't see a way apply the docValues before caching because of various FLs. 2) There is no metadata (that I can find) stored for each document that says whether it has an unstored docValue field, so efficiently loading docValues fields based on FL=* would be difficult. The only possibly way right now is to iterate over all schema fields looking for the viable docValue fields for each matched document. 3) More of a question: What kind of FieldType should be created when adding these docValues to the document? I have an alternate patch that attempts to preload stored values from docValues before handing over to the IndexReader. Those fields are then skipped later on. It passes tests, but it's not a very elegent approach. And also has its limitations; it only works when FL is specified and it wouldn't work on the cached hit of the document if a field is loaded with LazyDocument.
          Hide
          Ishan Chattopadhyaya added a comment -

          Thanks for the patch, Keith.
          I couldn't apply the patch. Is this based on trunk? I saw that SolrIndexSearcher.doc() method returns StoredDocument on trunk, but it seems your patch assumes a Document.
          If this isn't based on trunk, can you please re-work the patch and update it to trunk? Also, an SVN patch would be easier for most developers to work with.

          Show
          Ishan Chattopadhyaya added a comment - Thanks for the patch, Keith. I couldn't apply the patch. Is this based on trunk? I saw that SolrIndexSearcher.doc() method returns StoredDocument on trunk, but it seems your patch assumes a Document . If this isn't based on trunk, can you please re-work the patch and update it to trunk? Also, an SVN patch would be easier for most developers to work with.
          Hide
          Ishan Chattopadhyaya added a comment -

          Updated your patch to trunk (svn) with no changes to your logic. I'll review it soon.

          Show
          Ishan Chattopadhyaya added a comment - Updated your patch to trunk (svn) with no changes to your logic. I'll review it soon.
          Hide
          Yonik Seeley added a comment -

          2) There is no metadata (that I can find) stored for each document that says
          whether it has an unstored docValue field, so efficiently loading docValues
          fields based on FL=* would be difficult.

          In SolrIndexSearcher there is

            private final FieldInfos fieldInfos;
            private final Collection<String> fieldNames;
          

          That gives you the full set of fields actually in-use by the index. Useful for dealing with dynamicFields, where the names aren't explicitly listed in the schema.

          In the future, perhaps we should have some meta-data about what fields each document contains.

          Show
          Yonik Seeley added a comment - 2) There is no metadata (that I can find) stored for each document that says whether it has an unstored docValue field, so efficiently loading docValues fields based on FL=* would be difficult. In SolrIndexSearcher there is private final FieldInfos fieldInfos; private final Collection< String > fieldNames; That gives you the full set of fields actually in-use by the index. Useful for dealing with dynamicFields, where the names aren't explicitly listed in the schema. In the future, perhaps we should have some meta-data about what fields each document contains.
          Hide
          Keith Laban added a comment -

          Thanks Ishan Chattopadhyaya, this is my first time submitting a patch. This was originally done against 5.3.1 in the future i'll submit it based on trunk.

          I also noticed that there is a bug in the patch I submited.

          The line which reads

          if(null == sf){
            return doc;
          }
          

          should be

          if(null == sf){
            continue;
          }
          

          Yonik Seeley I think that we definitely need something like that. Currently CompressingStoredFieldsReader.visitDocument needs to visit the whole document in stored field to know which fields there are. Ideally we would 1) be able to avoid doing any work in stored fields if they can instead by read out of docValues. 2) have a mechanism for LazyDocument to know to read the lazy field from docValues instead of from stored fields. 3) know about values which aren't stored but should be read from docValues.

          Show
          Keith Laban added a comment - Thanks Ishan Chattopadhyaya , this is my first time submitting a patch. This was originally done against 5.3.1 in the future i'll submit it based on trunk. I also noticed that there is a bug in the patch I submited. The line which reads if ( null == sf){ return doc; } should be if ( null == sf){ continue ; } Yonik Seeley I think that we definitely need something like that. Currently CompressingStoredFieldsReader.visitDocument needs to visit the whole document in stored field to know which fields there are. Ideally we would 1) be able to avoid doing any work in stored fields if they can instead by read out of docValues. 2) have a mechanism for LazyDocument to know to read the lazy field from docValues instead of from stored fields. 3) know about values which aren't stored but should be read from docValues.
          Hide
          Yonik Seeley added a comment -

          1) be able to avoid doing any work in stored fields if they can instead by read out of docValues.

          We should be able to do this optimization regardless? For the fields requested, we can check the schema to see if they have docValues.

          Show
          Yonik Seeley added a comment - 1) be able to avoid doing any work in stored fields if they can instead by read out of docValues. We should be able to do this optimization regardless? For the fields requested, we can check the schema to see if they have docValues.
          Hide
          Keith Laban added a comment -

          I'm talking more specifically about the "FL=*" scenario

          Show
          Keith Laban added a comment - I'm talking more specifically about the "FL=*" scenario
          Hide
          Keith Laban added a comment -

          And for the specified FL scenario, LazyDocument assumes all the possible fields have been accounted for in the document and the not yet loaded fields have a bit offset for quick access in the future. Right now there's no way to cache a partially loaded document unless the whole stored document is visited to pre populate the lazy field with the required information.

          Show
          Keith Laban added a comment - And for the specified FL scenario, LazyDocument assumes all the possible fields have been accounted for in the document and the not yet loaded fields have a bit offset for quick access in the future. Right now there's no way to cache a partially loaded document unless the whole stored document is visited to pre populate the lazy field with the required information.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          Here's an alternate patch (SOLR-8220-ishan.patch) for this functionality. Uses the same hooks that you've created, i.e. decorateDocValueFields(), and uses your unit tests.

          • Changed the handling of single valued dv fields to have less code,
          • Adds support for multivalued dv fields support,
          • In your patch, fl=*,mydvfield wasn't working. Fixed this.
          • Added a ~ glob, similar to *. fl= here means: return all conventional stored fields and all non stored docvalues.

          Some cleanup / refactoring and a few tests might be needed.

          1) be able to avoid doing any work in stored fields if they can instead by read out of docValues.

          Good point, this optimization would be useful. Not done in this patch, though.

          Show
          Ishan Chattopadhyaya added a comment - - edited Here's an alternate patch ( SOLR-8220 -ishan.patch) for this functionality. Uses the same hooks that you've created, i.e. decorateDocValueFields(), and uses your unit tests. Changed the handling of single valued dv fields to have less code, Adds support for multivalued dv fields support, In your patch, fl=*,mydvfield wasn't working. Fixed this. Added a ~ glob, similar to * . fl= here means: return all conventional stored fields and all non stored docvalues. Some cleanup / refactoring and a few tests might be needed. 1) be able to avoid doing any work in stored fields if they can instead by read out of docValues. Good point, this optimization would be useful. Not done in this patch, though.
          Hide
          Ishan Chattopadhyaya added a comment -

          Updated last patch to use FieldInfos, since searcher.getSchema().getFields() doesn't have the dynamic fields. Modified the test to randomly test the fl parameter with ~ or the dv field name itself.

          Show
          Ishan Chattopadhyaya added a comment - Updated last patch to use FieldInfos, since searcher.getSchema().getFields() doesn't have the dynamic fields. Modified the test to randomly test the fl parameter with ~ or the dv field name itself.
          Hide
          Jan Høydahl added a comment -

          How about using fl=** as glob instead of ~? In Lucene, ~ normally indicates fuzziness of some kind, while ** in ant lingo means all items recursively... Not important though

          Show
          Jan Høydahl added a comment - How about using fl=** as glob instead of ~ ? In Lucene, ~ normally indicates fuzziness of some kind, while ** in ant lingo means all items recursively... Not important though
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          ** seems like a better idea than ~. I couldn't get + to work, as Keith suggested, I guess it was getting confused with a positive sign.

          Show
          Ishan Chattopadhyaya added a comment - - edited ** seems like a better idea than ~. I couldn't get + to work, as Keith suggested, I guess it was getting confused with a positive sign.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          My patch causes a regression with score fields, I'll look into it in a while.

          Show
          Ishan Chattopadhyaya added a comment - - edited My patch causes a regression with score fields, I'll look into it in a while.
          Hide
          Keith Laban added a comment -

          I see some potential issues here, mostly performance concerns:

          1)

            if (wantsAllNonStoredDocValues) {
              Set<String> dvFields = new HashSet<>();
              for (int i=0; i<fieldInfos.size(); i++) {
                if (fieldInfos.fieldInfo(i) != null && fieldInfos.fieldInfo(i).getDocValuesType() != DocValuesType.NONE) {
                  dvFields.add(fieldInfos.fieldInfo(i).name);
                }
              }
              fields = dvFields;
            }
          

          This could potentially be very expensive to compute for every singe field for ever single document and also add unnecessary GC pressure by creating new HashSet for all the fields for every single document.

          2)

          doc.getField(fieldName)==null
          

          the doc fields are a list so this will be O( n ) for each lookup.

          3) Re multivalued fields: doing introspection for every single value for field for every document is not fast.

          4)

          SchemaField schemaField = schema.getField(fieldName);
          

          this throws an exception if the field name is not in the schema (think typos in FL)

          5)

          FunctionValues values = schemaField.getType().getValueSource(schemaField, null).getValues(null, getLeafReader().getContext());
          

          This creates a whole bunch of new objects which could be slow and cause a lot of GC pressure, although it may not be an issue.

          Show
          Keith Laban added a comment - I see some potential issues here, mostly performance concerns: 1) if (wantsAllNonStoredDocValues) { Set< String > dvFields = new HashSet<>(); for ( int i=0; i<fieldInfos.size(); i++) { if (fieldInfos.fieldInfo(i) != null && fieldInfos.fieldInfo(i).getDocValuesType() != DocValuesType.NONE) { dvFields.add(fieldInfos.fieldInfo(i).name); } } fields = dvFields; } This could potentially be very expensive to compute for every singe field for ever single document and also add unnecessary GC pressure by creating new HashSet for all the fields for every single document. 2) doc.getField(fieldName)== null the doc fields are a list so this will be O( n ) for each lookup. 3) Re multivalued fields: doing introspection for every single value for field for every document is not fast. 4) SchemaField schemaField = schema.getField(fieldName); this throws an exception if the field name is not in the schema (think typos in FL) 5) FunctionValues values = schemaField.getType().getValueSource(schemaField, null ).getValues( null , getLeafReader().getContext()); This creates a whole bunch of new objects which could be slow and cause a lot of GC pressure, although it may not be an issue.
          Hide
          Ishan Chattopadhyaya added a comment -

          Thanks for the review, Keith.

          1. [...] This could potentially be very expensive to compute for every singe field for ever single document and also add unnecessary GC pressure by creating new HashSet for all the fields for every single document.

          I was aware of this, and wanted to fix this as part of the "cleanup / refactoring" I promised.

          doc.getField(fieldName)==null the doc fields are a list so this will be O( n ) for each lookup.

          I used that to ensure we're not re-adding unstored docvalues a second time to the same document. This is necessary here so that we don't re-add such fields to a document was obtained from the documentCache and already has all unstored docvalues in it. I can create a set of fields inside the StoredDocument class so that a hasField lookup can be speeded up. However, given that it is a Lucene class, I have left this be. Any suggestions?

          3) Re multivalued fields: doing introspection for every single value for field for every document is not fast.

          I think it shouldn't be a problem. In modern JVMs, the instanceof has negligible cost. However, I will do it once per multivalued field in my next patch.

          4) SchemaField schemaField = schema.getField(fieldName); this throws an exception if the field name is not in the schema (think typos in FL)

          If it is a dynamic field, it will still work; a wrong field name won't work here. Shouldn't a wrong field name throw an exception, rather than silently dropping it? I am split either ways.

          This creates a whole bunch of new objects which could be slow and cause a lot of GC pressure, although it may not be an issue.

          I think this creates at most only the value source object, which isn't too bad. Internally, it uses the docvalues API.

          Show
          Ishan Chattopadhyaya added a comment - Thanks for the review, Keith. 1. [...] This could potentially be very expensive to compute for every singe field for ever single document and also add unnecessary GC pressure by creating new HashSet for all the fields for every single document. I was aware of this, and wanted to fix this as part of the "cleanup / refactoring" I promised. doc.getField(fieldName)==null the doc fields are a list so this will be O( n ) for each lookup. I used that to ensure we're not re-adding unstored docvalues a second time to the same document. This is necessary here so that we don't re-add such fields to a document was obtained from the documentCache and already has all unstored docvalues in it. I can create a set of fields inside the StoredDocument class so that a hasField lookup can be speeded up. However, given that it is a Lucene class, I have left this be. Any suggestions? 3) Re multivalued fields: doing introspection for every single value for field for every document is not fast. I think it shouldn't be a problem. In modern JVMs, the instanceof has negligible cost. However, I will do it once per multivalued field in my next patch. 4) SchemaField schemaField = schema.getField(fieldName); this throws an exception if the field name is not in the schema (think typos in FL) If it is a dynamic field, it will still work; a wrong field name won't work here. Shouldn't a wrong field name throw an exception, rather than silently dropping it? I am split either ways. This creates a whole bunch of new objects which could be slow and cause a lot of GC pressure, although it may not be an issue. I think this creates at most only the value source object, which isn't too bad. Internally, it uses the docvalues API.
          Hide
          Keith Laban added a comment -

          I used that to ensure we're not re-adding unstored docvalues a second time to the same document. This is necessary here so that we don't re-add such fields to a document was obtained from the documentCache and already has all unstored docvalues in it. I can create a set of fields inside the StoredDocument class so that a hasField lookup can be speeded up. However, given that it is a Lucene class, I have left this be. Any suggestions?

          This shouldn't be an issue since the hook is called after caching is done. This could get really expensive if you are getting a few thousand documents that have hundreds of fields. I think the real issue is how do we cache this efficiently. I think that will require modifying LazyDocument, (see my comments above)

          If it is a dynamic field, it will still work; a wrong field name won't work here. Shouldn't a wrong field name throw an exception, rather than silently dropping it? I am split either ways.

          This is more a backwards compat thing. What is current behavior for stored fields?

          I think this creates at most only the value source object, which isn't too bad. Internally, it uses the docvalues API.

          for a string field, getValueSource creates a new StrFieldSource and getValues creates a new DocTermsIndexDocValues. Both of these closures add overhead especially if you're doing this hundreds of times for thousands of documents

          Show
          Keith Laban added a comment - I used that to ensure we're not re-adding unstored docvalues a second time to the same document. This is necessary here so that we don't re-add such fields to a document was obtained from the documentCache and already has all unstored docvalues in it. I can create a set of fields inside the StoredDocument class so that a hasField lookup can be speeded up. However, given that it is a Lucene class, I have left this be. Any suggestions? This shouldn't be an issue since the hook is called after caching is done. This could get really expensive if you are getting a few thousand documents that have hundreds of fields. I think the real issue is how do we cache this efficiently. I think that will require modifying LazyDocument, (see my comments above) If it is a dynamic field, it will still work; a wrong field name won't work here. Shouldn't a wrong field name throw an exception, rather than silently dropping it? I am split either ways. This is more a backwards compat thing. What is current behavior for stored fields? I think this creates at most only the value source object, which isn't too bad. Internally, it uses the docvalues API. for a string field, getValueSource creates a new StrFieldSource and getValues creates a new DocTermsIndexDocValues. Both of these closures add overhead especially if you're doing this hundreds of times for thousands of documents
          Hide
          Ishan Chattopadhyaya added a comment -

          This shouldn't be an issue since the hook is called after caching is done.

          Even if this is called after the document has been added to the cache, this decorate() method changes the same doc object that has been added to the cache. And hence, next time the document is fetched from the cache, it will contain the previously decorated docvalues as part of the stored doc from the cache.

          I'll look at what it will take to modify the LazyDocument to make this work differently. Are you already looking into it, or have some thoughts around it?

          Both of these closures add overhead especially if you're doing this hundreds of times for thousands of documents

          Yes, that makes sense; I hadn't noticed the second object getting created. We should avoid this overhead if possible.

          Show
          Ishan Chattopadhyaya added a comment - This shouldn't be an issue since the hook is called after caching is done. Even if this is called after the document has been added to the cache, this decorate() method changes the same doc object that has been added to the cache. And hence, next time the document is fetched from the cache, it will contain the previously decorated docvalues as part of the stored doc from the cache. I'll look at what it will take to modify the LazyDocument to make this work differently. Are you already looking into it, or have some thoughts around it? Both of these closures add overhead especially if you're doing this hundreds of times for thousands of documents Yes, that makes sense; I hadn't noticed the second object getting created. We should avoid this overhead if possible.
          Hide
          Yonik Seeley added a comment -

          Added a ~ glob, similar to *. fl= here means: return all conventional stored fields and all non stored docvalues.

          Purely from an interface perspective (I haven't looked at the code), it feels like this should be transparent.
          It would be nice to be able to transition from an indexed+stored field to an indexed+docValues field and not have any of the clients know/care.

          fl=myfield  // returns from either stored or docValues
          fl=*_i         // returns all stored or docValues fields ending in _i
          fl=*            // returns all stored fields and all docValues fields that are not stored
          

          If there is a need to distinguish between docValues as an alternative to a stored field, and docValues as an implementation detail that you don't want to return to the user (say you transitioned from an indexed-only field to an indexed+docValues field or docValues-only field), then we could introduce a field flag for the schema.
          Something like includeInStored=true/false or asStored=true/false

          Show
          Yonik Seeley added a comment - Added a ~ glob, similar to *. fl= here means: return all conventional stored fields and all non stored docvalues. Purely from an interface perspective (I haven't looked at the code), it feels like this should be transparent. It would be nice to be able to transition from an indexed+stored field to an indexed+docValues field and not have any of the clients know/care. fl=myfield // returns from either stored or docValues fl=*_i // returns all stored or docValues fields ending in _i fl=* // returns all stored fields and all docValues fields that are not stored If there is a need to distinguish between docValues as an alternative to a stored field, and docValues as an implementation detail that you don't want to return to the user (say you transitioned from an indexed-only field to an indexed+docValues field or docValues-only field), then we could introduce a field flag for the schema. Something like includeInStored=true/false or asStored=true/false
          Hide
          Keith Laban added a comment -

          If there is a need to distinguish between docValues as an alternative to a stored field

          I think this would be the case only for multi valued fields at least until we had an alternative version of docValue multi valued preserving the original field (i.e. not sorted, not set) using something like BinaryDocValues underneath as you mentioned earlier.

          I'll look at what it will take to modify the LazyDocument to make this work differently. Are you already looking into it, or have some thoughts around it?

          Doing this properly requires us to be able to know all the possibly docValue fields on a document upfront and a way for LazyDocument to be able to load the lazy field from doc values.

          A large goal of this should be to have the ability to skip reading stored fields altogether if the field requirement is fully satisfied by docValues. However I'm not sure if using docValues would be more efficient than stored fields when all the fields are being returned.

          Show
          Keith Laban added a comment - If there is a need to distinguish between docValues as an alternative to a stored field I think this would be the case only for multi valued fields at least until we had an alternative version of docValue multi valued preserving the original field (i.e. not sorted, not set) using something like BinaryDocValues underneath as you mentioned earlier. I'll look at what it will take to modify the LazyDocument to make this work differently. Are you already looking into it, or have some thoughts around it? Doing this properly requires us to be able to know all the possibly docValue fields on a document upfront and a way for LazyDocument to be able to load the lazy field from doc values. A large goal of this should be to have the ability to skip reading stored fields altogether if the field requirement is fully satisfied by docValues. However I'm not sure if using docValues would be more efficient than stored fields when all the fields are being returned.
          Hide
          Yonik Seeley added a comment -

          I think this would be the case only for multi valued fields at least until we had an alternative version of docValue multi valued preserving the original field (i.e. not sorted, not set) using something like BinaryDocValues underneath as you mentioned earlier.

          Yup, I agree. I think this is just a case of us having incomplete type support. We need to distinguish between multiValued and setValued in general.

          Show
          Yonik Seeley added a comment - I think this would be the case only for multi valued fields at least until we had an alternative version of docValue multi valued preserving the original field (i.e. not sorted, not set) using something like BinaryDocValues underneath as you mentioned earlier. Yup, I agree. I think this is just a case of us having incomplete type support. We need to distinguish between multiValued and setValued in general.
          Hide
          Ishan Chattopadhyaya added a comment -

          Updated patch. This has:

          • Now uses ** for returning all non-stored docvalues fields.
          • Using docvalues API directly for single valued fields (instead of valuesource api)
          • Few cleanups here and there.
          Show
          Ishan Chattopadhyaya added a comment - Updated patch. This has: Now uses ** for returning all non-stored docvalues fields. Using docvalues API directly for single valued fields (instead of valuesource api) Few cleanups here and there.
          Hide
          Ishan Chattopadhyaya added a comment -

          Purely from an interface perspective (I haven't looked at the code), it feels like this should be transparent.

          That makes sense, having * return all stored and non-stored docvalues.

          Show
          Ishan Chattopadhyaya added a comment - Purely from an interface perspective (I haven't looked at the code), it feels like this should be transparent. That makes sense, having * return all stored and non-stored docvalues.
          Hide
          Yonik Seeley added a comment -

          The set of fields that have docValues and are not stored can be computed once per index snapshot (from the FieldInfos+schema).
          There should be no performance impact if there are no un-stored docValues fields in use.

          Show
          Yonik Seeley added a comment - The set of fields that have docValues and are not stored can be computed once per index snapshot (from the FieldInfos+schema). There should be no performance impact if there are no un-stored docValues fields in use.
          Hide
          Ishan Chattopadhyaya added a comment -

          The set of fields that have docValues and are not stored can be computed once per index snapshot (from the FieldInfos+schema).

          That is what I have done at the time of searcher creation (in SolrIndexSearcher's constructor) in my last patch. Does that sound fine?

          Show
          Ishan Chattopadhyaya added a comment - The set of fields that have docValues and are not stored can be computed once per index snapshot (from the FieldInfos+schema). That is what I have done at the time of searcher creation (in SolrIndexSearcher's constructor) in my last patch. Does that sound fine?
          Hide
          Yonik Seeley added a comment -

          Does that sound fine?

          Yep. Seems like we will only have a perf issue when we have many sparse un-stored docValue fields. At that point it might make sense to have a separate docValues field that contains the list of fields for the document. That can be saved for a future optimization though.

          Show
          Yonik Seeley added a comment - Does that sound fine? Yep. Seems like we will only have a perf issue when we have many sparse un-stored docValue fields. At that point it might make sense to have a separate docValues field that contains the list of fields for the document. That can be saved for a future optimization though.
          Hide
          Erick Erickson added a comment -

          Given that stored values are compressed in 16k blocks and to return a single field from the stored data requires decompressing 16K, how much effect do LazyDocuments really have any more? I don't know, just askin'

          Since docValues avoids decompressing 16k per doc, disk seeks and the like I strongly suspect that it is vastly more efficient than getting the stored values. That's how Streaming Aggregation can return on 200k-400k docs/second.

          All that said, I suspect that there are negligible savings (or perhaps even costs) in mixing the two, i.e. if any field to be returned is not DV, you might as well return all the fields from the stored data.

          Testing would tell though.

          Show
          Erick Erickson added a comment - Given that stored values are compressed in 16k blocks and to return a single field from the stored data requires decompressing 16K, how much effect do LazyDocuments really have any more? I don't know, just askin' Since docValues avoids decompressing 16k per doc, disk seeks and the like I strongly suspect that it is vastly more efficient than getting the stored values. That's how Streaming Aggregation can return on 200k-400k docs/second. All that said, I suspect that there are negligible savings (or perhaps even costs) in mixing the two, i.e. if any field to be returned is not DV, you might as well return all the fields from the stored data. Testing would tell though.
          Hide
          Keith Laban added a comment -

          Added a patch based on Ishan Chattopadhyaya latest patch.
          needs to be perf tested.

          Theoretical optimization, will skip reading from stored fields if all the requested fields are available in docValues. (changes mostly to DocStreamer)
          Caveats being:

          • Cannot optimize if any fields are multi valued.
          • Cannot optimize for * queries.
          • Does not cache the document (slower in the long run?)
            • How can we cache? using doc.getField, perhaps? or LazyDocument?
          Show
          Keith Laban added a comment - Added a patch based on Ishan Chattopadhyaya latest patch. needs to be perf tested. Theoretical optimization, will skip reading from stored fields if all the requested fields are available in docValues. (changes mostly to DocStreamer) Caveats being: Cannot optimize if any fields are multi valued. Cannot optimize for * queries. Does not cache the document (slower in the long run?) How can we cache? using doc.getField, perhaps? or LazyDocument?
          Hide
          Keith Laban added a comment -

          reformatted patch to be svn style and cleaned up code from last update

          Show
          Keith Laban added a comment - reformatted patch to be svn style and cleaned up code from last update
          Hide
          Ishan Chattopadhyaya added a comment -

          Caveats being:

          I think we can live with those caveats for now, and optimize later. A docValues fields containing list of other docValues fields sounds nice for a later optimization. Given that this is a functional improvement, as is, over what we have today (i.e. no ability to return nonstored docValues), we should carry on with it and optimize later to address those caveats.

          Theoretical optimization, will skip reading from stored fields if all the requested fields are available in docValues. (changes mostly to DocStreamer)

          Sounds good. It would be interesting to perf test this to measure the performance gains with doing this.

          Show
          Ishan Chattopadhyaya added a comment - Caveats being: I think we can live with those caveats for now, and optimize later. A docValues fields containing list of other docValues fields sounds nice for a later optimization. Given that this is a functional improvement, as is, over what we have today (i.e. no ability to return nonstored docValues), we should carry on with it and optimize later to address those caveats. Theoretical optimization, will skip reading from stored fields if all the requested fields are available in docValues. (changes mostly to DocStreamer) Sounds good. It would be interesting to perf test this to measure the performance gains with doing this.
          Hide
          Yonik Seeley added a comment -

          Back in the day, LazyField actually had a pointer directly into the index where the field value could be read.
          That got remove from Lucene at some point, and was replaced with something just for compat sake IIRC that had an N^2 bug... doc was loaded on each lazy-field access, which Hoss found/fixed. But that leaves less performance benefit to using LazyDocument. On a quick look, it seems to load all lazy fields at once when the first lazy field is touched. I guess these days it's more of a memory optimization than a performance one.

          Might be worth considering new approaches (we can break back compat in trunk for 6.0).
          Or maybe subclass LazyDocument and do something different for docValues fields.

          Show
          Yonik Seeley added a comment - Back in the day, LazyField actually had a pointer directly into the index where the field value could be read. That got remove from Lucene at some point, and was replaced with something just for compat sake IIRC that had an N^2 bug... doc was loaded on each lazy-field access, which Hoss found/fixed. But that leaves less performance benefit to using LazyDocument. On a quick look, it seems to load all lazy fields at once when the first lazy field is touched. I guess these days it's more of a memory optimization than a performance one. Might be worth considering new approaches (we can break back compat in trunk for 6.0). Or maybe subclass LazyDocument and do something different for docValues fields.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          Updated the patch with a minor one line fix in the multi-valued string field case:

                       doc.add(schemaField.getType().createField(schemaField, values.lookupOrd(i).utf8ToString(), 1f));
          

          Fyi, since I had already copied over the multivalued fields support from SOLR-8276 patch to this patch earlier, I've made SOLR-8276 depend on this issue. So, if we fix this issue now, we'll be able to fix (a) search results with non stored docvalues, (b) RTG of documents containing non stored docvalues, (c) atomic updates of documents containing non stored docvalues (for updates to both regular fields as well as non stored docvalues). I will make SOLR-5944 depend on this now, and update the patch there.

          Show
          Ishan Chattopadhyaya added a comment - - edited Updated the patch with a minor one line fix in the multi-valued string field case: doc.add(schemaField.getType().createField(schemaField, values.lookupOrd(i).utf8ToString(), 1f)); Fyi, since I had already copied over the multivalued fields support from SOLR-8276 patch to this patch earlier, I've made SOLR-8276 depend on this issue. So, if we fix this issue now, we'll be able to fix (a) search results with non stored docvalues, (b) RTG of documents containing non stored docvalues, (c) atomic updates of documents containing non stored docvalues (for updates to both regular fields as well as non stored docvalues). I will make SOLR-5944 depend on this now, and update the patch there.
          Hide
          Keith Laban added a comment -

          I'm not sure how this should be handled.

          https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/document/Field.java#L241

          needs to be modified to

              if (!type.stored() && type.indexOptions() == IndexOptions.NONE && type.docValuesType() != DocValuesType.NONE) {
                throw new IllegalArgumentException("it doesn't make sense to have a field that "
                  + "is neither indexed nor stored nor docValues");
              }
          
          Show
          Keith Laban added a comment - I'm not sure how this should be handled. https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/document/Field.java#L241 needs to be modified to if (!type.stored() && type.indexOptions() == IndexOptions.NONE && type.docValuesType() != DocValuesType.NONE) { throw new IllegalArgumentException( "it doesn't make sense to have a field that " + "is neither indexed nor stored nor docValues" ); }
          Hide
          Ishan Chattopadhyaya added a comment -

          I think you meant a type.docValuesType() == DocValuesType.NONE. I agree, we should make the change.

          Show
          Ishan Chattopadhyaya added a comment - I think you meant a type.docValuesType() == DocValuesType.NONE . I agree, we should make the change.
          Hide
          Ishan Chattopadhyaya added a comment -
          fl=myfield  // returns from either stored or docValues
          fl=*_i         // returns all stored or docValues fields ending in _i
          fl=*            // returns all stored fields and all docValues fields that are not stored
          

          This is a TODO. Also, need to add more tests around multivalued fields (here and in SOLR-8276).
          As for LazyFields for non stored docvalues, I think it is an optimization that we can deal with later in a separate issue. Doing what we have in this patch itself is progress. If some committer can please take this forward, can we get this in for 5.4?
          Keith, do you wish to tackle the above TODO (and other todo items)? If so, I will focus on SOLR-5944 for now.

          Show
          Ishan Chattopadhyaya added a comment - fl=myfield // returns from either stored or docValues fl=*_i // returns all stored or docValues fields ending in _i fl=* // returns all stored fields and all docValues fields that are not stored This is a TODO. Also, need to add more tests around multivalued fields (here and in SOLR-8276 ). As for LazyFields for non stored docvalues, I think it is an optimization that we can deal with later in a separate issue. Doing what we have in this patch itself is progress. If some committer can please take this forward, can we get this in for 5.4? Keith, do you wish to tackle the above TODO (and other todo items)? If so, I will focus on SOLR-5944 for now.
          Hide
          Keith Laban added a comment -

          Yonik Seeley: re. * vs **

          I'm all for using just one glob pattern * for both doc values and stored, however I think its worth it to consider the philosophical implications on backwards compat. While it won't break anything it does introduce some unexpected behavior without much warning or a way to disable it.

          I propose we add a fl.wildcardDV=true option to turn on this behavior in Solr 5x but enable it by default in 6. We can optionally later add a field type option where you can use docValues but not have the field returned in your result set.

          Ishan I can tackle those.

          Regarding my earlier update about modifying Field, its doesn't seem as trivial as I originally thought as createField doesn't set DocValuesType there is a separate field that gets created for doc values after createField is called for the first time. I'm not sure what the implications of modifying this behavior would be. I think it's ok to leave this limitation in for now.

          Show
          Keith Laban added a comment - Yonik Seeley : re. * vs ** I'm all for using just one glob pattern * for both doc values and stored, however I think its worth it to consider the philosophical implications on backwards compat. While it won't break anything it does introduce some unexpected behavior without much warning or a way to disable it. I propose we add a fl.wildcardDV=true option to turn on this behavior in Solr 5x but enable it by default in 6. We can optionally later add a field type option where you can use docValues but not have the field returned in your result set. Ishan I can tackle those. Regarding my earlier update about modifying Field , its doesn't seem as trivial as I originally thought as createField doesn't set DocValuesType there is a separate field that gets created for doc values after createField is called for the first time. I'm not sure what the implications of modifying this behavior would be. I think it's ok to leave this limitation in for now.
          Hide
          Keith Laban added a comment -

          Created SOLR-8316 for my last point

          Show
          Keith Laban added a comment - Created SOLR-8316 for my last point
          Hide
          Yonik Seeley added a comment - - edited

          I think you meant a type.docValuesType() == DocValuesType.NONE. I agree, we should make the change.

          I don't think some of the Lucene folks want docValues modeled as stored fields at the Lucene level.

          One possible option is just move to something higher lievel like SolrDocument.

          I propose we add a fl.wildcardDV=true option to turn on this behavior in Solr 5x but enable it by default in 6.

          We could bump the schema number to change the default, and that would enable the folks on 5x to get transparent migration from stored fields to docValues if they want w/o having to change clients / query params.

          And if finer grained control is desirable, a schema field flag actAsStored=true (or whatever better name people come up with) could have it's default set differently based on the schema version.

          Show
          Yonik Seeley added a comment - - edited I think you meant a type.docValuesType() == DocValuesType.NONE. I agree, we should make the change. I don't think some of the Lucene folks want docValues modeled as stored fields at the Lucene level. One possible option is just move to something higher lievel like SolrDocument. I propose we add a fl.wildcardDV=true option to turn on this behavior in Solr 5x but enable it by default in 6. We could bump the schema number to change the default, and that would enable the folks on 5x to get transparent migration from stored fields to docValues if they want w/o having to change clients / query params. And if finer grained control is desirable, a schema field flag actAsStored=true (or whatever better name people come up with) could have it's default set differently based on the schema version.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          We are adding fields retrieved from docValues by doing the following:

          doc.add(schemaField.getType().createField(schemaField, sdv.get(docid).utf8ToString(), 1.0f));
          

          this createField call is returning null based on the code I wrote above. Perhaps we need to create fields differently, or change how createField works.

          [Referencing this comment from SOLR-8316]. Can we "decorate" the SolrDocument in DocStreamer instead of trying to do that with the StoredDocument from lucene? That will give us the benefits: (a) we won't need to fix SOLR-8316 (although I still don't understand how it affects the work here, since I thought the createField was doing its job and consequently the tests were passing. Maybe I'm missing something), (b) we can leave the StoredDocument as is, and not change it from under the document cache (which is probably an awkward thing with the current patch), (c) it has efficient containsKey(), if needed, so the linear O(n) cost can be avoided. Though, point b will mean we won't need containsKey() anyway.
          This also means that SOLR-8276 will have to change, and there we will have to decorate a SolrInputDocument instead of a SolrDocument.
          Keith, Yonik, what do you think?

          Show
          Ishan Chattopadhyaya added a comment - - edited We are adding fields retrieved from docValues by doing the following: doc.add(schemaField.getType().createField(schemaField, sdv.get(docid).utf8ToString(), 1.0f)); this createField call is returning null based on the code I wrote above. Perhaps we need to create fields differently, or change how createField works. [Referencing this comment from SOLR-8316] . Can we "decorate" the SolrDocument in DocStreamer instead of trying to do that with the StoredDocument from lucene? That will give us the benefits: (a) we won't need to fix SOLR-8316 (although I still don't understand how it affects the work here, since I thought the createField was doing its job and consequently the tests were passing. Maybe I'm missing something), (b) we can leave the StoredDocument as is, and not change it from under the document cache (which is probably an awkward thing with the current patch), (c) it has efficient containsKey(), if needed, so the linear O(n) cost can be avoided. Though, point b will mean we won't need containsKey() anyway. This also means that SOLR-8276 will have to change, and there we will have to decorate a SolrInputDocument instead of a SolrDocument. Keith, Yonik, what do you think?
          Hide
          Keith Laban added a comment -

          It looks like calling createFields instead actually creates both the fields we need.

          I was able to change decoration to something like this

          SortedDocValues sdv = leafReader.getSortedDocValues(fieldName);
          for(StorableField s: schemaField.getType().createFields(schemaField, sdv.get(docid).utf8ToString(), 1.0f)) {
            if(s != null) doc.add(s);
          }
          

          which makes the the SortedDocValueField get added to the document properly, but when trying to write the string value later on it doesn't write anything because the implementation in Field doesn't know how to write a BytesRef. We can override this is in SortedDocValueField but all of that stuff is in lucene code. It looks like StrField#createFields converts the string value to a BytesRef for the constructors of the doc values fields.

          I'm still a bit confused how SOLR-8276 works for you, i get a NPE when trying pull back the non-indexed/non-stored field in the current impl.

          Show
          Keith Laban added a comment - It looks like calling createFields instead actually creates both the fields we need. I was able to change decoration to something like this SortedDocValues sdv = leafReader.getSortedDocValues(fieldName); for (StorableField s: schemaField.getType().createFields(schemaField, sdv.get(docid).utf8ToString(), 1.0f)) { if (s != null ) doc.add(s); } which makes the the SortedDocValueField get added to the document properly, but when trying to write the string value later on it doesn't write anything because the implementation in Field doesn't know how to write a BytesRef . We can override this is in SortedDocValueField but all of that stuff is in lucene code. It looks like StrField#createFields converts the string value to a BytesRef for the constructors of the doc values fields. I'm still a bit confused how SOLR-8276 works for you, i get a NPE when trying pull back the non-indexed/non-stored field in the current impl.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          I'm still a bit confused how SOLR-8276 works for you, i get a NPE when trying pull back the non-indexed/non-stored field in the current impl.

          I added another document (id=4) to the test at SOLR-8276. I see no problems whatsoever with string dv fields (single valued), which internally uses SortedDocValues. The test passes fine. Also, the test at BasicFunctionalityTest works fine with the test_s_dvo field. Both SOLR-8276 and the latter test use the latest patch here. So, as per the tests, the createField seems to do its job. Am I missing something?

          However, beyond this point, should we avoid using the schemaField.getType().createField() for fields in the StoredDocument (lucene) and instead do this decoration on the SolrDocument which is created from this StoredDocument? (See my comment before this one). I can try to put together a strawman patch for this approach, if you suggest, to see if it works.

          Show
          Ishan Chattopadhyaya added a comment - - edited I'm still a bit confused how SOLR-8276 works for you, i get a NPE when trying pull back the non-indexed/non-stored field in the current impl. I added another document (id=4) to the test at SOLR-8276 . I see no problems whatsoever with string dv fields (single valued), which internally uses SortedDocValues. The test passes fine. Also, the test at BasicFunctionalityTest works fine with the test_s_dvo field. Both SOLR-8276 and the latter test use the latest patch here. So, as per the tests, the createField seems to do its job. Am I missing something? However, beyond this point, should we avoid using the schemaField.getType().createField() for fields in the StoredDocument (lucene) and instead do this decoration on the SolrDocument which is created from this StoredDocument? (See my comment before this one). I can try to put together a strawman patch for this approach, if you suggest, to see if it works.
          Hide
          Keith Laban added a comment -

          I changed decorateDocValueFields to operate based on a SolrDocument instead of a lucene StoredDocument and no longer relies on createField (and it now works on non-stored/non-indexed fields). Also includes a test for the non-stored/non-indexed field type.

          In this patch I also removed code related to ** in favor of * for docValue related fields. Still TODO is to make this behavior only apply to a new schema version.

          Show
          Keith Laban added a comment - I changed decorateDocValueFields to operate based on a SolrDocument instead of a lucene StoredDocument and no longer relies on createField (and it now works on non-stored/non-indexed fields). Also includes a test for the non-stored/non-indexed field type. In this patch I also removed code related to ** in favor of * for docValue related fields. Still TODO is to make this behavior only apply to a new schema version.
          Hide
          Ishan Chattopadhyaya added a comment -

          Updating the patch with the following changes:

          1. Since this decorate method would be used from the RealTimeGetComponent as well, and there it will have to decorate a SolrInputDocument, I've changed the method to handle both a SolrDocument and SolrInputDocument, depending on what is sent in.
          2. The BasicFunctionalityTest was failing for me since it depended on a field "test_s_dv", which didn't exist in the schema. I've added that to the schema.xml.
          3. Added a javadoc for the decorate method.

          Keith, please review. If you think these changes make sense, then I'll base SOLR-8276 on this one. As you mentioned, the schema version check is still a TODO.

          Show
          Ishan Chattopadhyaya added a comment - Updating the patch with the following changes: Since this decorate method would be used from the RealTimeGetComponent as well, and there it will have to decorate a SolrInputDocument, I've changed the method to handle both a SolrDocument and SolrInputDocument, depending on what is sent in. The BasicFunctionalityTest was failing for me since it depended on a field "test_s_dv", which didn't exist in the schema. I've added that to the schema.xml. Added a javadoc for the decorate method. Keith, please review. If you think these changes make sense, then I'll base SOLR-8276 on this one. As you mentioned, the schema version check is still a TODO.
          Hide
          Ishan Chattopadhyaya added a comment -

          Just noticed, EnumFieldTest.testEnumSort() fails with the last patch (and the one before it) when the severity_dv is chosen as the enum field. I think we're handling the enum dv fields incorrectly.

          Show
          Ishan Chattopadhyaya added a comment - Just noticed, EnumFieldTest.testEnumSort() fails with the last patch (and the one before it) when the severity_dv is chosen as the enum field. I think we're handling the enum dv fields incorrectly.
          Hide
          Keith Laban added a comment -

          I'm working on a separate patch which fixes EnumField, it also adds support for "*_dv" type queries. I'll take a look at merging your change in too. Do you think it would be worth adding an interface for SolrDocument and SolrInputDocument to implement which includes containsKey} and {{addField

          Show
          Keith Laban added a comment - I'm working on a separate patch which fixes EnumField, it also adds support for "*_dv" type queries. I'll take a look at merging your change in too. Do you think it would be worth adding an interface for SolrDocument and SolrInputDocument to implement which includes containsKey} and {{addField
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          I was thinking of doing exactly that! SOLR-8339

          Show
          Ishan Chattopadhyaya added a comment - - edited I was thinking of doing exactly that! SOLR-8339
          Hide
          Shalin Shekhar Mangar added a comment -

          Guys, sorry for not paying attention to this earlier but on a quick reading through the comments and an offline conversation with Ishan, I want to point out a few things.

          Theoretical optimization, will skip reading from stored fields if all the requested fields are available in docValues

          I don't think some of the Lucene folks want docValues modeled as stored fields at the Lucene level.

          From a performance perspective, reading values from DocValues always (if they exist) can be horrible because each field access in docvalues may need a random disk seek, whereas, all stored fields for a document are kept together and need only 1 random seek and a sequential block read. That, and the fact that docvalues aren't in the document cache makes me think that we should not model docvalues as a stored field and treat them equivalently. At least not without supporting benchmarks.

          So my suggestion is that we not mix the two issues i.e. keep this issue focused on adding syntactic sugar to read field from doc values for non-stored fields in whatever ways proposed. By the way, this is already possible using the 'field' DocTransformer e.g. fl=field(my_dv_field)

          Show
          Shalin Shekhar Mangar added a comment - Guys, sorry for not paying attention to this earlier but on a quick reading through the comments and an offline conversation with Ishan, I want to point out a few things. Theoretical optimization, will skip reading from stored fields if all the requested fields are available in docValues I don't think some of the Lucene folks want docValues modeled as stored fields at the Lucene level. From a performance perspective, reading values from DocValues always (if they exist) can be horrible because each field access in docvalues may need a random disk seek, whereas, all stored fields for a document are kept together and need only 1 random seek and a sequential block read. That, and the fact that docvalues aren't in the document cache makes me think that we should not model docvalues as a stored field and treat them equivalently. At least not without supporting benchmarks. So my suggestion is that we not mix the two issues i.e. keep this issue focused on adding syntactic sugar to read field from doc values for non-stored fields in whatever ways proposed. By the way, this is already possible using the 'field' DocTransformer e.g. fl=field(my_dv_field)
          Hide
          Ishan Chattopadhyaya added a comment -

          So my suggestion is that we not mix the two issues

          I've created SOLR-8344 to deal with that optimization of reading stored valued fields from docvalues, so that we can focus only on the non-stored dv case (which is a functionality improvement/new feature, as opposed to an optimization).

          By the way, this is already possible using the 'field' DocTransformer e.g. fl=field(my_dv_field)

          Indeed. This currently doesn't work for multivalued fields.

          On a different note, if we are going to tackle only the non-stored docValues fields for now in this issue, does it now make sense to do this, performance wise, at the DocTransformer instead of the SolrIndexSearcher? If done that way, the RTG and atomic updates for such non-stored dv fields will work if we also use a doctransformer after the searcher has returned the docs. Yonik Seeley, Erick Erickson, Shalin Shekhar Mangar do you have any thoughts/recommendation, please?

          Show
          Ishan Chattopadhyaya added a comment - So my suggestion is that we not mix the two issues I've created SOLR-8344 to deal with that optimization of reading stored valued fields from docvalues, so that we can focus only on the non-stored dv case (which is a functionality improvement/new feature, as opposed to an optimization). By the way, this is already possible using the 'field' DocTransformer e.g. fl=field(my_dv_field) Indeed. This currently doesn't work for multivalued fields. On a different note, if we are going to tackle only the non-stored docValues fields for now in this issue, does it now make sense to do this, performance wise , at the DocTransformer instead of the SolrIndexSearcher? If done that way, the RTG and atomic updates for such non-stored dv fields will work if we also use a doctransformer after the searcher has returned the docs. Yonik Seeley , Erick Erickson , Shalin Shekhar Mangar do you have any thoughts/recommendation, please?
          Hide
          Keith Laban added a comment -

          From a performance perspective, reading values from DocValues always (if they exist) can be horrible because each field access in docvalues may need a random disk seek, whereas, all stored fields for a document are kept together and need only 1 random seek and a sequential block read.

          I agree that reading values from DocValues always can be horrible, my line of thinking was that if you are asking for one or two fields from a large document and they are both dv and stored reading from dv would likely be much more efficient. It might make more sense to be able to get those values explicitly from docValues using the the transformer, or have some logic that can determine when it is more efficient. That should be discussed further in SOLR-8344.

          At this point the question that remains; should we move forward with these patches and move logic for retrieving dv fields to SolrIndexSearcher, leaving out *, *_foo and other optimizations for now? i.e. retrieve fields by name, if they exist in dv, but are not stored.

          Show
          Keith Laban added a comment - From a performance perspective, reading values from DocValues always (if they exist) can be horrible because each field access in docvalues may need a random disk seek, whereas, all stored fields for a document are kept together and need only 1 random seek and a sequential block read. I agree that reading values from DocValues always can be horrible, my line of thinking was that if you are asking for one or two fields from a large document and they are both dv and stored reading from dv would likely be much more efficient. It might make more sense to be able to get those values explicitly from docValues using the the transformer, or have some logic that can determine when it is more efficient. That should be discussed further in SOLR-8344 . At this point the question that remains; should we move forward with these patches and move logic for retrieving dv fields to SolrIndexSearcher, leaving out * , *_foo and other optimizations for now? i.e. retrieve fields by name, if they exist in dv, but are not stored.
          Hide
          Yonik Seeley added a comment -

          From a performance perspective, reading values from DocValues always (if they exist) can be horrible because each field access in docvalues may need a random disk seek, whereas, all stored fields for a document are kept together and need only 1 random seek and a sequential block read.

          A few points:

          • stored fields also require decompression (more overhead)
          • use of stored fields and docvalues at the same time is less memory efficient - the stored fields will also take up needed disk cache (although hopefully the OS will figure out which it should cache more aggressively
          • presumably one has docvalues because they need to be used, and they need to be fast... i.e. they already need to be cached.
          • if one as a small set of fields that are normally retrieved, it seems like a win again.
          • a very common case these days is that the entire index fits in memory.
          • we're in the SSD era, and multiple "seeks" will still be more expensive if not cached, but much less so (and less so over time as non-volatile storage keeps improving)

          It seems like this should be a big win for the common case, and the ability to reindex your data or change config and not have to change the clients is important IMO. It's like being able to reindex a date to a trie-date and have the clients not care. We can already reindex a field as docValues, and sort, facet, do analytics, without changing client requests. Optimizations to field value retrieval (or optionally removing redundantly stored data) should be the same.

          Show
          Yonik Seeley added a comment - From a performance perspective, reading values from DocValues always (if they exist) can be horrible because each field access in docvalues may need a random disk seek, whereas, all stored fields for a document are kept together and need only 1 random seek and a sequential block read. A few points: stored fields also require decompression (more overhead) use of stored fields and docvalues at the same time is less memory efficient - the stored fields will also take up needed disk cache (although hopefully the OS will figure out which it should cache more aggressively presumably one has docvalues because they need to be used, and they need to be fast... i.e. they already need to be cached. if one as a small set of fields that are normally retrieved, it seems like a win again. a very common case these days is that the entire index fits in memory. we're in the SSD era, and multiple "seeks" will still be more expensive if not cached, but much less so (and less so over time as non-volatile storage keeps improving) It seems like this should be a big win for the common case, and the ability to reindex your data or change config and not have to change the clients is important IMO. It's like being able to reindex a date to a trie-date and have the clients not care. We can already reindex a field as docValues, and sort, facet, do analytics, without changing client requests. Optimizations to field value retrieval (or optionally removing redundantly stored data) should be the same.
          Hide
          Shalin Shekhar Mangar added a comment -

          It seems like this should be a big win for the common case, and the ability to reindex your data or change config and not have to change the clients is important IMO.

          It sounds like you are arguing for a common way to access docvalues and stored fields using the 'fl' parameter. I'm +1 to that.

          But are you also arguing for always loading fields from docvalues even if they are stored?

          Show
          Shalin Shekhar Mangar added a comment - It seems like this should be a big win for the common case, and the ability to reindex your data or change config and not have to change the clients is important IMO. It sounds like you are arguing for a common way to access docvalues and stored fields using the 'fl' parameter. I'm +1 to that. But are you also arguing for always loading fields from docvalues even if they are stored?
          Hide
          Shalin Shekhar Mangar added a comment -

          On a different note, if we are going to tackle only the non-stored docValues fields for now in this issue, does it now make sense to do this, performance wise, at the DocTransformer instead of the SolrIndexSearcher?

          I don't think there's any difference performance-wise. Changes to DocStreamer should be enough as it is called only for writing the response and not the entire result-set.

          At this point the question that remains; should we move forward with these patches and move logic for retrieving dv fields to SolrIndexSearcher, leaving out *, *_foo and other optimizations for now? i.e. retrieve fields by name, if they exist in dv, but are not stored.

          +1 let's create a patch to retrieve fields by name, if they exist in dv, but are not stored. I also like Yonik's idea of bumping the schema version to have fl=* return all fields (stored + non-stored docvalues) in 5.x and to include both by default in trunk (6.x). So +1 to that as well.

          a very common case these days is that the entire index fits in memory.

          I propose a middle ground. Let's use Lucene's spinning disk utility method and prefer docvalues if we detect a SSD and fallback to reading from stored fields otherwise. Let's discuss this optimization in SOLR-8344 and keep the two issues separate.

          Show
          Shalin Shekhar Mangar added a comment - On a different note, if we are going to tackle only the non-stored docValues fields for now in this issue, does it now make sense to do this, performance wise, at the DocTransformer instead of the SolrIndexSearcher? I don't think there's any difference performance-wise. Changes to DocStreamer should be enough as it is called only for writing the response and not the entire result-set. At this point the question that remains; should we move forward with these patches and move logic for retrieving dv fields to SolrIndexSearcher, leaving out *, *_foo and other optimizations for now? i.e. retrieve fields by name, if they exist in dv, but are not stored. +1 let's create a patch to retrieve fields by name, if they exist in dv, but are not stored. I also like Yonik's idea of bumping the schema version to have fl=* return all fields (stored + non-stored docvalues) in 5.x and to include both by default in trunk (6.x). So +1 to that as well. a very common case these days is that the entire index fits in memory. I propose a middle ground. Let's use Lucene's spinning disk utility method and prefer docvalues if we detect a SSD and fallback to reading from stored fields otherwise. Let's discuss this optimization in SOLR-8344 and keep the two issues separate.
          Hide
          Yonik Seeley added a comment -

          It sounds like you are arguing for a common way to access docvalues and stored fields using the 'fl' parameter. I'm +1 to that.

          Ah, ok... I mis-read your previous comment of "i.e. keep this issue focused on adding syntactic sugar to read field from doc values for non-stored fields" as advocating for new syntax for "fl" to load from non-stored docValue fields.

          Let's discuss this optimization in SOLR-8344 and keep the two issues separate.

          I didn't really see it as separate (it depends on how you look at it), I see it more as, we have a new feature that treats docValues as "column-stored". What should the default behavior be when all requested fields are both column-stored and row-stored? I think we can make progress + commit this issue separately, but should still come at SOLR-8344 "fresh" (i.e. not put the burden of proof on one default more than the other).

          Show
          Yonik Seeley added a comment - It sounds like you are arguing for a common way to access docvalues and stored fields using the 'fl' parameter. I'm +1 to that. Ah, ok... I mis-read your previous comment of "i.e. keep this issue focused on adding syntactic sugar to read field from doc values for non-stored fields" as advocating for new syntax for "fl" to load from non-stored docValue fields. Let's discuss this optimization in SOLR-8344 and keep the two issues separate. I didn't really see it as separate (it depends on how you look at it), I see it more as, we have a new feature that treats docValues as "column-stored". What should the default behavior be when all requested fields are both column-stored and row-stored? I think we can make progress + commit this issue separately, but should still come at SOLR-8344 "fresh" (i.e. not put the burden of proof on one default more than the other).
          Hide
          Shalin Shekhar Mangar added a comment -

          Agreed. Let's wrap this up. Ishan Chattopadhyaya or Keith Laban, can one of you put up an updated patch? I can review and commit.

          Show
          Shalin Shekhar Mangar added a comment - Agreed. Let's wrap this up. Ishan Chattopadhyaya or Keith Laban , can one of you put up an updated patch? I can review and commit.
          Hide
          Ishan Chattopadhyaya added a comment -

          Updating the patch.

          1. Only tackles non-stored docvalues.
          2. Depends on the refactoring performed at SOLR-8339.
          Show
          Ishan Chattopadhyaya added a comment - Updating the patch. Only tackles non-stored docvalues. Depends on the refactoring performed at SOLR-8339 .
          Hide
          Keith Laban added a comment -

          I found some issues with the way this patch works and have some updated tests in separate patch which I haven't uploaded, I need to clean it up a bit and can submit it tonight.

          One issues I found is that if a document doesn't have a field value for a dv field it will get an empty value in the response, we should probably check with in the doc value api for docs with field, to make sure that that document had a value.

          Show
          Keith Laban added a comment - I found some issues with the way this patch works and have some updated tests in separate patch which I haven't uploaded, I need to clean it up a bit and can submit it tonight. One issues I found is that if a document doesn't have a field value for a dv field it will get an empty value in the response, we should probably check with in the doc value api for docs with field, to make sure that that document had a value.
          Hide
          Ishan Chattopadhyaya added a comment -

          Afaik, there's no way to discern if a singly valued non-stored dv field was added to the document or not, since a singly valued dv field either gets the value provided during the indexing or it picks up the default. Any suggestions how to deal with that?

          Show
          Ishan Chattopadhyaya added a comment - Afaik, there's no way to discern if a singly valued non-stored dv field was added to the document or not, since a singly valued dv field either gets the value provided during the indexing or it picks up the default. Any suggestions how to deal with that?
          Hide
          Erick Erickson added a comment - - edited

          Deleting, see discussion at SOLR-8344

          Show
          Erick Erickson added a comment - - edited Deleting, see discussion at SOLR-8344
          Hide
          Yonik Seeley added a comment -

          For strings, an ord of -1 is "missing"
          For numerics, you can use DocValues.getDocsWithField();

          Show
          Yonik Seeley added a comment - For strings, an ord of -1 is "missing" For numerics, you can use DocValues.getDocsWithField();
          Hide
          Ishan Chattopadhyaya added a comment -

          have some updated tests in separate patch which I haven't uploaded, I need to clean it up a bit and can submit it tonight.

          I've fixed the issue in this updated patch, thanks to Yonik's suggestion. Keith, looking forward to the tests.

          Show
          Ishan Chattopadhyaya added a comment - have some updated tests in separate patch which I haven't uploaded, I need to clean it up a bit and can submit it tonight. I've fixed the issue in this updated patch, thanks to Yonik's suggestion. Keith, looking forward to the tests.
          Hide
          Keith Laban added a comment -
          • Removed leaked in SolrBaseDocument references
          • Fixed support for EnumField (single and multi valued)
          • Fixed support for *_i type globs
          • Removed tests from BasicFuntionalityTests and moved them to a new test class along with its own test schema.
          • Added more robust testing including multivalued field testing
          Show
          Keith Laban added a comment - Removed leaked in SolrBaseDocument references Fixed support for EnumField (single and multi valued) Fixed support for *_i type globs Removed tests from BasicFuntionalityTests and moved them to a new test class along with its own test schema. Added more robust testing including multivalued field testing
          Hide
          Keith Laban added a comment -

          This still needs work around bumping the schema version, this impl changes the default behavior for globs

          Show
          Keith Laban added a comment - This still needs work around bumping the schema version, this impl changes the default behavior for globs
          Hide
          Keith Laban added a comment -

          Crushed one more bug with multivalued fields being added on docs without that value and a test for it

          Show
          Keith Laban added a comment - Crushed one more bug with multivalued fields being added on docs without that value and a test for it
          Hide
          Keith Laban added a comment -

          One more patch.. It looks like at some point the logic for onlyPseudoFields was modified to something that may not work, so reverting that line back to the original.

          Show
          Keith Laban added a comment - One more patch.. It looks like at some point the logic for onlyPseudoFields was modified to something that may not work, so reverting that line back to the original.
          Hide
          Ishan Chattopadhyaya added a comment -

          Looks good, Keith. I was wondering why you chose to name the test file and the schema file as "TestStoredDocValues" and "schema-stored-docvalues.xml", whereas we are dealing with only non-stored docvalues in this issue?

          I've updated the patch to now check for the schema version to be >=1.6. Please review, and feel free to refactor further. We'll have to add a test (using a schema version < 1.6) to ensure backcompat behaviour is not broken.

          Show
          Ishan Chattopadhyaya added a comment - Looks good, Keith. I was wondering why you chose to name the test file and the schema file as "TestStoredDocValues" and "schema-stored-docvalues.xml", whereas we are dealing with only non-stored docvalues in this issue? I've updated the patch to now check for the schema version to be >=1.6. Please review, and feel free to refactor further. We'll have to add a test (using a schema version < 1.6) to ensure backcompat behaviour is not broken.
          Hide
          Ishan Chattopadhyaya added a comment -

          Removed leaked in SolrBaseDocument references

          The intention of using SolrDocumentBase, instead of SolrDocument, was so that SOLR-8276 can use the same decorate method on a SolrInputDocument. I've re-added the SolrDocumentBase reference, which is implemented in SOLR-8339. Please apply SOLR-8339 first, and then apply/work with this patch.

          Show
          Ishan Chattopadhyaya added a comment - Removed leaked in SolrBaseDocument references The intention of using SolrDocumentBase, instead of SolrDocument, was so that SOLR-8276 can use the same decorate method on a SolrInputDocument. I've re-added the SolrDocumentBase reference, which is implemented in SOLR-8339 . Please apply SOLR-8339 first, and then apply/work with this patch.
          Hide
          Ishan Chattopadhyaya added a comment -

          I've updated the patch to now check for the schema version to be >=1.6

          In the patch, I did some clumsy >= check for a float. Would've been better to just check for > 1.5. I'll fix it with the next patch.
          I'm working on bumping up the version of the schema for the out of the box configsets.

          Show
          Ishan Chattopadhyaya added a comment - I've updated the patch to now check for the schema version to be >=1.6 In the patch, I did some clumsy >= check for a float. Would've been better to just check for > 1.5. I'll fix it with the next patch. I'm working on bumping up the version of the schema for the out of the box configsets.
          Hide
          Ishan Chattopadhyaya added a comment -

          I've updated the patch to now check for the schema version to be >=1.6

          As per an offline discussion with Hoss Man, he mentioned that in the past we've never tied runtime behaviour with specific schema versions. He suggested, as did Yonik above I think, that we use some attribute like docValueAsIfStored=true/false (with a default of false for previous schema versions).
          I'll try to tackle this and update the patch, and drop the naive check for schema version.

          Show
          Ishan Chattopadhyaya added a comment - I've updated the patch to now check for the schema version to be >=1.6 As per an offline discussion with Hoss Man , he mentioned that in the past we've never tied runtime behaviour with specific schema versions. He suggested, as did Yonik above I think, that we use some attribute like docValueAsIfStored=true/false (with a default of false for previous schema versions). I'll try to tackle this and update the patch, and drop the naive check for schema version.
          Hide
          Keith Laban added a comment -

          My only concern is that we may have to add a flag for this and for whatever we decide in SOLR-8344 which is just going to add to he confusion. Thoughts?

          Show
          Keith Laban added a comment - My only concern is that we may have to add a flag for this and for whatever we decide in SOLR-8344 which is just going to add to he confusion. Thoughts?
          Hide
          Ishan Chattopadhyaya added a comment -

          Updating the patch to bump the version up to 1.6.

          1. Renamed Keith's TestStoredDocValues to TestNonStoredDocValues. Renamed the corresponding schema file too.
          2. Adds useDocValuesAsStored parameter (true/false) to schema file. It defaults to true >= 1.6 version of schema.
          3. Changing all 1.5 version schema files (test and examples) to 1.6. This will help catch bugs/unexpected behaviour.

          The patch is failing a few tests with very poor error reporting. Here's a reproducible failure after applying this patch:
          Test suite: TestManagedSchemaDynamicFieldResource, Seed: -Dtests.seed=C0DE559FF2A0799
          Looking into the failure.

          Show
          Ishan Chattopadhyaya added a comment - Updating the patch to bump the version up to 1.6. Renamed Keith's TestStoredDocValues to TestNonStoredDocValues. Renamed the corresponding schema file too. Adds useDocValuesAsStored parameter (true/false) to schema file. It defaults to true >= 1.6 version of schema. Changing all 1.5 version schema files (test and examples) to 1.6. This will help catch bugs/unexpected behaviour. The patch is failing a few tests with very poor error reporting. Here's a reproducible failure after applying this patch: Test suite: TestManagedSchemaDynamicFieldResource, Seed: -Dtests.seed=C0DE559FF2A0799 Looking into the failure.
          Hide
          Ishan Chattopadhyaya added a comment -

          The patch is failing a few tests with very poor error reporting. Here's a reproducible failure after applying this patch:
          Test suite: TestManagedSchemaDynamicFieldResource, Seed: -Dtests.seed=C0DE559FF2A0799
          Looking into the failure.

          It seems this test fails even without the patch. Filed SOLR-8411 for this. However, with this patch, still 5-6 tests fail. But I've so far been unable to reproduce any of them.

          Show
          Ishan Chattopadhyaya added a comment - The patch is failing a few tests with very poor error reporting. Here's a reproducible failure after applying this patch: Test suite: TestManagedSchemaDynamicFieldResource, Seed: -Dtests.seed=C0DE559FF2A0799 Looking into the failure. It seems this test fails even without the patch. Filed SOLR-8411 for this. However, with this patch, still 5-6 tests fail. But I've so far been unable to reproduce any of them.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          It seems the failures on trunk were unrelated to this patch, and possibly related to the Jetty upgrade to 9.3. To verify this, I ported the patch to branch5x (attached the patch) and it passes all tests.

          TODO: Make sure managed schema can handle the newly added useDocValuesAsStored parameter in the schema (test for persistence across edits to managed schema).

          Apart from the TODO item, can Keith Laban and Shalin Shekhar Mangar please review the patch (the trunk one)?

          My only concern is that we may have to add a flag for this and for whatever we decide in SOLR-8344 which is just going to add to he confusion. Thoughts?

          I think the useDocValuesAsStored parameter in the schema is generically named enough to be reused for SOLR-8344, if needed. For more fine-grained per-field control, we can introduce another parameter if needed. We can discuss that in SOLR-8344 itself.

          Show
          Ishan Chattopadhyaya added a comment - - edited It seems the failures on trunk were unrelated to this patch, and possibly related to the Jetty upgrade to 9.3. To verify this, I ported the patch to branch5x (attached the patch) and it passes all tests. TODO: Make sure managed schema can handle the newly added useDocValuesAsStored parameter in the schema (test for persistence across edits to managed schema). Apart from the TODO item, can Keith Laban and Shalin Shekhar Mangar please review the patch (the trunk one)? My only concern is that we may have to add a flag for this and for whatever we decide in SOLR-8344 which is just going to add to he confusion. Thoughts? I think the useDocValuesAsStored parameter in the schema is generically named enough to be reused for SOLR-8344 , if needed. For more fine-grained per-field control, we can introduce another parameter if needed. We can discuss that in SOLR-8344 itself.
          Hide
          Ishan Chattopadhyaya added a comment -

          Updated the patch.

          TODO: Make sure managed schema can handle the newly added useDocValuesAsStored parameter in the schema (test for persistence across edits to managed schema).

          Added support for writing back (persisting) of this flag when used with managed schema.

          Show
          Ishan Chattopadhyaya added a comment - Updated the patch. TODO: Make sure managed schema can handle the newly added useDocValuesAsStored parameter in the schema (test for persistence across edits to managed schema). Added support for writing back (persisting) of this flag when used with managed schema.
          Hide
          Shalin Shekhar Mangar added a comment -

          Hi Ishan, you have misunderstood the "useDocValuesAsStored" parameter. It is supposed to be per-field and not on the entire schema so that you can selectively disable it on fields that you don't want to be treated as if they were stored.

          Show
          Shalin Shekhar Mangar added a comment - Hi Ishan, you have misunderstood the "useDocValuesAsStored" parameter. It is supposed to be per-field and not on the entire schema so that you can selectively disable it on fields that you don't want to be treated as if they were stored.
          Hide
          Ishan Chattopadhyaya added a comment -

          Updated the patch to have the useDocValuesAsStored attribute on a per field basis (instead of being at schema's top level).

          Show
          Ishan Chattopadhyaya added a comment - Updated the patch to have the useDocValuesAsStored attribute on a per field basis (instead of being at schema's top level).
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          Thanks Ishan.

          1. The SolrIndexSearcher.decorateDocValueFields method has a honourUseDVsAsStoredFlag which is always true. We can remove it?
          2. Same for SolrIndexSearcher.getNonStoredDocValuesFieldNames?
          3. The wantsAllFields flag added to SolrIndexSearcher.doc doesn't seem necessary. I guess it was added because the patch adds non stored doc values fields to the 'fnames' but if we can separate out stored fnames from the non-stored doc values to be returned then we can remove this param from both SolrIndexSearcher.doc and SolrIndexSearcher.getNonStoredDocValuesFieldNames
          4. The pattern matching in the DocStreamer constructor makes a bit nervous. Where is the pattern matching done for current stored fields?
          5. The conditional logic in SolrIndexSearcher.decorateDocValueFields for multi-valued fields is too complicated! Can we please simplify this?
          Show
          Shalin Shekhar Mangar added a comment - - edited Thanks Ishan. The SolrIndexSearcher.decorateDocValueFields method has a honourUseDVsAsStoredFlag which is always true. We can remove it? Same for SolrIndexSearcher.getNonStoredDocValuesFieldNames? The wantsAllFields flag added to SolrIndexSearcher.doc doesn't seem necessary. I guess it was added because the patch adds non stored doc values fields to the 'fnames' but if we can separate out stored fnames from the non-stored doc values to be returned then we can remove this param from both SolrIndexSearcher.doc and SolrIndexSearcher.getNonStoredDocValuesFieldNames The pattern matching in the DocStreamer constructor makes a bit nervous. Where is the pattern matching done for current stored fields? The conditional logic in SolrIndexSearcher.decorateDocValueFields for multi-valued fields is too complicated! Can we please simplify this?
          Hide
          Ishan Chattopadhyaya added a comment -

          Thanks for your review, Shalin. I've updated the patch to address your suggestions.

          The SolrIndexSearcher.decorateDocValueFields method has a honourUseDVsAsStoredFlag which is always true. We can remove it?

          Same for SolrIndexSearcher.getNonStoredDocValuesFieldNames?

          Refactored the decorateDocValues() a bit to not send in wantsAllFields flag to the method and to handle it at the DocsStreamer itself. Hence, now, the decorateDocValues() method takes in only the field names it needs to do anything about; the filtering for non-stored dvs is taken care of at DocsStreamer.next() itself.

          Since, for the fl=* case, we need all non-stored DVs that have useDocValuesAsStored=true, but for the general filtering case of fl=dv1,dv2 we need to filter using all non-stored DVs (irrespective of the useDocValuesAsStored flag), I've retained this true/false logic in the getNonStoredDocValuesFieldNames() method. Renamed that method, however, to call it getNonStoredDVs(boolean onlyUseDocValuesAsStored) and added a clear javadoc to this effect.

          The wantsAllFields flag added to SolrIndexSearcher.doc doesn't seem necessary. I guess it was added because the patch adds non stored doc values fields to the 'fnames' but if we can separate out stored fnames from the non-stored doc values to be returned then we can remove this param from both SolrIndexSearcher.doc and SolrIndexSearcher.getNonStoredDocValuesFieldNames

          I think the original motivation was to deal with cases fl=*,nonstoredDv1. Here, the idea initially was that * returns all stored fields, and nonstoredDv1 is added to it. But now, since * takes care of all stored and non-stored dvs, this logic isn't needed. So, this wantsAllFields flag was a left over from a previous patch which I've now removed.

          The pattern matching in the DocStreamer constructor makes a bit nervous. Where is the pattern matching done for current stored fields?

          Keith can weigh in on this better. However, I had a look, and found that responseWriters (e.g. JSONResponseWriter) get the whole SolrDocument at the writeSolrDocument() method, from where it does the following call to drop fields it doesn't need:

              for (String fname : doc.getFieldNames()) {
                if (returnFields!= null && !returnFields.wantsField(fname)) {
                  continue;
                }
          

          This wantsField() call uses wildcard handling.
          So, reviewing this information, it seems like our handling of this at the DocsStreamer is fine here. It doesn't look costly to me, since it is performed only when fl has a pattern, and that pattern is checked against only non-stored DVs. Do you think there's something better that can be done which I'm missing?

          The conditional logic in SolrIndexSearcher.decorateDocValueFields for multi-valued fields is too complicated! Can we please simplify this?

          Made it simpler.

          Show
          Ishan Chattopadhyaya added a comment - Thanks for your review, Shalin. I've updated the patch to address your suggestions. The SolrIndexSearcher.decorateDocValueFields method has a honourUseDVsAsStoredFlag which is always true. We can remove it? Same for SolrIndexSearcher.getNonStoredDocValuesFieldNames? Refactored the decorateDocValues() a bit to not send in wantsAllFields flag to the method and to handle it at the DocsStreamer itself. Hence, now, the decorateDocValues() method takes in only the field names it needs to do anything about; the filtering for non-stored dvs is taken care of at DocsStreamer.next() itself. Since, for the fl=* case, we need all non-stored DVs that have useDocValuesAsStored =true, but for the general filtering case of fl=dv1,dv2 we need to filter using all non-stored DVs (irrespective of the useDocValuesAsStored flag), I've retained this true/false logic in the getNonStoredDocValuesFieldNames() method. Renamed that method, however, to call it getNonStoredDVs(boolean onlyUseDocValuesAsStored) and added a clear javadoc to this effect. The wantsAllFields flag added to SolrIndexSearcher.doc doesn't seem necessary. I guess it was added because the patch adds non stored doc values fields to the 'fnames' but if we can separate out stored fnames from the non-stored doc values to be returned then we can remove this param from both SolrIndexSearcher.doc and SolrIndexSearcher.getNonStoredDocValuesFieldNames I think the original motivation was to deal with cases fl=*,nonstoredDv1 . Here, the idea initially was that * returns all stored fields, and nonstoredDv1 is added to it. But now, since * takes care of all stored and non-stored dvs, this logic isn't needed. So, this wantsAllFields flag was a left over from a previous patch which I've now removed. The pattern matching in the DocStreamer constructor makes a bit nervous. Where is the pattern matching done for current stored fields? Keith can weigh in on this better. However, I had a look, and found that responseWriters (e.g. JSONResponseWriter) get the whole SolrDocument at the writeSolrDocument() method, from where it does the following call to drop fields it doesn't need: for ( String fname : doc.getFieldNames()) { if (returnFields!= null && !returnFields.wantsField(fname)) { continue ; } This wantsField() call uses wildcard handling. So, reviewing this information, it seems like our handling of this at the DocsStreamer is fine here. It doesn't look costly to me, since it is performed only when fl has a pattern, and that pattern is checked against only non-stored DVs. Do you think there's something better that can be done which I'm missing? The conditional logic in SolrIndexSearcher.decorateDocValueFields for multi-valued fields is too complicated! Can we please simplify this? Made it simpler.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          Btw, just found out that not all query paths actually use a DocsStreamer. I am checking as to what this could be down to.
          EDIT: Sorry, I was seeing ghosts. Was trying this from the admin UI, but I hadn't set the breakpoint properly.

          Show
          Ishan Chattopadhyaya added a comment - - edited Btw, just found out that not all query paths actually use a DocsStreamer. I am checking as to what this could be down to. EDIT: Sorry, I was seeing ghosts. Was trying this from the admin UI, but I hadn't set the breakpoint properly.
          Hide
          Erick Erickson added a comment -

          WARNING: I'm stealing this code and back-porting to 4.x for my own purposes so this may not pertain to 5x. And I'm not very up on the low-level details.

          But this loop for reading multiValued fields puts all the multivalued fields for all the docs on the shard into each doc:

          if (values != null && DocValues.getDocsWithField(atomicReader, fieldName).get(docid)) {
              values.setDocument(docid);
              if (values.getValueCount() > 0) {
                        List<Object> outValues = new LinkedList<Object>();
                        for (int i = 0; i < values.getValueCount(); i++) { // Iterates more than just this doc, I think all of them!
                  
          

          I had more luck with

                    if (values != null) {
                      values.setDocument(docid);
                      List<Object> outValues = new LinkedList<Object>();
                      for (int ord = (int) values.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = (int) values.nextOrd()) {
          

          Note that I also think this is unnecessary in the if test, the loop above doesn't do anything bad if there are docs with empty fields:

          DocValues.getDocsWithField(atomicReader, fieldName).get(docid)
          

          I changed it to just if (values != null). But I did have to test outValues.size() > 0 before doing the addField after the loop or I got empty braces in the output doc.

          Again let me emphasize that
          1> I don't know this code well, so take this with a grain of salt
          2> I needed this for a one-off on the 4.x code line and this may work with 5x just fine as-is. Needless to say what I'm doing will never make into the official project....

          But this saved me a TON of work, glad you're tackling this!

          Show
          Erick Erickson added a comment - WARNING: I'm stealing this code and back-porting to 4.x for my own purposes so this may not pertain to 5x. And I'm not very up on the low-level details. But this loop for reading multiValued fields puts all the multivalued fields for all the docs on the shard into each doc: if (values != null && DocValues.getDocsWithField(atomicReader, fieldName).get(docid)) { values.setDocument(docid); if (values.getValueCount() > 0) { List< Object > outValues = new LinkedList< Object >(); for ( int i = 0; i < values.getValueCount(); i++) { // Iterates more than just this doc, I think all of them! I had more luck with if (values != null ) { values.setDocument(docid); List< Object > outValues = new LinkedList< Object >(); for ( int ord = ( int ) values.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = ( int ) values.nextOrd()) { Note that I also think this is unnecessary in the if test, the loop above doesn't do anything bad if there are docs with empty fields: DocValues.getDocsWithField(atomicReader, fieldName).get(docid) I changed it to just if (values != null). But I did have to test outValues.size() > 0 before doing the addField after the loop or I got empty braces in the output doc. Again let me emphasize that 1> I don't know this code well, so take this with a grain of salt 2> I needed this for a one-off on the 4.x code line and this may work with 5x just fine as-is. Needless to say what I'm doing will never make into the official project.... But this saved me a TON of work, glad you're tackling this!
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Ishan.

          Since, for the fl=* case, we need all non-stored DVs that have useDocValuesAsStored=true, but for the general filtering case of fl=dv1,dv2 we need to filter using all non-stored DVs (irrespective of the useDocValuesAsStored flag)

          Okay I see what you are saying. The useDocValuesAsStored=true default applies when you request all fields but if you are explicitly asking for a field then we can return it from DVs even if it was marked as useDocValuesAsStored=false. I have mixed feelings about this but I can see where it can be useful e.g. 1st phase of distributed search.

          . However, I had a look, and found that responseWriters (e.g. JSONResponseWriter) get the whole SolrDocument at the writeSolrDocument() method, from where it does the following call to drop fields it doesn't need

          Hmm, yeah, we can't do that with doc values, it'd be too expensive.

          Is there a test which creates a new field with useDocValuesAsStored as true and separately as false using the schema API? I'm assuming you will address Erick's concern above about multi-valued fields.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Ishan. Since, for the fl=* case, we need all non-stored DVs that have useDocValuesAsStored=true, but for the general filtering case of fl=dv1,dv2 we need to filter using all non-stored DVs (irrespective of the useDocValuesAsStored flag) Okay I see what you are saying. The useDocValuesAsStored=true default applies when you request all fields but if you are explicitly asking for a field then we can return it from DVs even if it was marked as useDocValuesAsStored=false. I have mixed feelings about this but I can see where it can be useful e.g. 1st phase of distributed search. . However, I had a look, and found that responseWriters (e.g. JSONResponseWriter) get the whole SolrDocument at the writeSolrDocument() method, from where it does the following call to drop fields it doesn't need Hmm, yeah, we can't do that with doc values, it'd be too expensive. Is there a test which creates a new field with useDocValuesAsStored as true and separately as false using the schema API? I'm assuming you will address Erick's concern above about multi-valued fields.
          Hide
          Ishan Chattopadhyaya added a comment -

          Is there a test which creates a new field with useDocValuesAsStored as true and separately as false using the schema API?

          I had added SchemaVersionSpecificBehaviorTest to test for these various true/false cases. However, there is no useDocValuesAsStored=false case with checking of output. I'll add such a test.

          I'm assuming you will address Erick's concern above about multi-valued fields.

          I'm working through them. So far as I can see, both the current loop with values.getValueCount() and what Erick suggested as a loop are running identically, i.e values.getValueCount() is indeed returning the count of values per document. But I am adding a test to prove it.

          For the DocValues.getDocsWithField(atomicReader, fieldName).get(docid), not having it was resulting in empty fields being returned for documents that weren't supposed to have an docValue (the user never added a docValue for that document during indexing). Again, I think I should add a specific test for that, testing for the number of fields returned (maybe there already is one from Keith, but I'll check again).

          Show
          Ishan Chattopadhyaya added a comment - Is there a test which creates a new field with useDocValuesAsStored as true and separately as false using the schema API? I had added SchemaVersionSpecificBehaviorTest to test for these various true/false cases. However, there is no useDocValuesAsStored=false case with checking of output. I'll add such a test. I'm assuming you will address Erick's concern above about multi-valued fields. I'm working through them. So far as I can see, both the current loop with values.getValueCount() and what Erick suggested as a loop are running identically, i.e values.getValueCount() is indeed returning the count of values per document. But I am adding a test to prove it. For the DocValues.getDocsWithField(atomicReader, fieldName).get(docid) , not having it was resulting in empty fields being returned for documents that weren't supposed to have an docValue (the user never added a docValue for that document during indexing). Again, I think I should add a specific test for that, testing for the number of fields returned (maybe there already is one from Keith, but I'll check again).
          Hide
          Erick Erickson added a comment -

          bq: For the DocValues.getDocsWithField(atomicReader, fieldName).get(docid), not having it was resulting in empty fields being returned for documents that weren't supposed to have an docValue (the user never added a docValue for that document during indexing).

          Right, I had to add a test at the end to avoid that. I didn't track the code thoroughly, but does the DocValues.getDocsWithField allocate a BitSet or just return a pre-existing instance? Or even cache the BitSet somewhere? If it allocates a new BitSet (or even fills up a cache entry), the test at the end might be much less expensive. I didn't track it down though, and if it returns a reference to a cached bitset that will be created anyway, then it's just a style thing....

           if (outValues.size() > 0) {
             sdoc.addField()....
          }
          

          As for whether the loop returns all values in the field, I saw this "by inspection" on the techproducts example (with a few mods for adding docValues="true" to the schema). Again, though, this is 4.x after I hacked a backport and put it in an entirely different place in the code, specifically NOT a visitor pattern. So it's entirely possible that the semantics have changed or hacking it into a different part of the code base has a different context. A test would settle it for all time though.

          Show
          Erick Erickson added a comment - bq: For the DocValues.getDocsWithField(atomicReader, fieldName).get(docid), not having it was resulting in empty fields being returned for documents that weren't supposed to have an docValue (the user never added a docValue for that document during indexing). Right, I had to add a test at the end to avoid that. I didn't track the code thoroughly, but does the DocValues.getDocsWithField allocate a BitSet or just return a pre-existing instance? Or even cache the BitSet somewhere? If it allocates a new BitSet (or even fills up a cache entry), the test at the end might be much less expensive. I didn't track it down though, and if it returns a reference to a cached bitset that will be created anyway , then it's just a style thing.... if (outValues.size() > 0) { sdoc.addField().... } As for whether the loop returns all values in the field, I saw this "by inspection" on the techproducts example (with a few mods for adding docValues="true" to the schema). Again, though, this is 4.x after I hacked a backport and put it in an entirely different place in the code, specifically NOT a visitor pattern. So it's entirely possible that the semantics have changed or hacking it into a different part of the code base has a different context. A test would settle it for all time though.
          Hide
          Ishan Chattopadhyaya added a comment -

          Added tests for the previous two scenarios.
          @Erick, thanks for the catch, the values.getValueCount() wasn't working indeed, and only after adding the test for it could I catch it.

          Shalin, Erick, please review this latest patch. Thanks.

          Show
          Ishan Chattopadhyaya added a comment - Added tests for the previous two scenarios. @Erick, thanks for the catch, the values.getValueCount() wasn't working indeed, and only after adding the test for it could I catch it. Shalin, Erick, please review this latest patch. Thanks.
          Hide
          Ishan Chattopadhyaya added a comment -

          does the DocValues.getDocsWithField allocate a BitSet or just return a pre-existing instance

          While I tried to track it down, I thought it is backed by a pre-existing bitset instance, as created at the leaf reader level. I didn't think it was a performance concern. Could someone confirm this understanding, please? Moreover, since Yonik suggested its use, I was more confident about using it.

          Show
          Ishan Chattopadhyaya added a comment - does the DocValues.getDocsWithField allocate a BitSet or just return a pre-existing instance While I tried to track it down, I thought it is backed by a pre-existing bitset instance, as created at the leaf reader level. I didn't think it was a performance concern. Could someone confirm this understanding, please? Moreover, since Yonik suggested its use, I was more confident about using it.
          Hide
          Erick Erickson added a comment -

          LGTM as far as not returning all the DV values with each doc.

          Show
          Erick Erickson added a comment - LGTM as far as not returning all the DV values with each doc.
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes:

          1. DocStreamer calculates dv fields to return in the constructor for every scenario instead of doing conditional logic and set intersection in next(). Also uses HashSet instead of Treeset (we don't care about order)
          2. Renamed nonStoredDVs in SolrIndexSearcher to allNonStoredDVs
          3. Added simple javadocs describing allNonStoredDVs and nonStoredDVsUsedAsStored
          4. Reformatted new code according to the project code style
          5. Fixed formatting in schema-nonstored-docvalues.xml which failed precommit (tabs instead of spaces)
          6. Added a basic sanity test in TestUseDocValuesAsStored.testNonStoredDocValueFieldsOnEmptyIndex to assert that no exceptions are thrown on unknown fields with various fl combinations.

          Ishan, can you please add a test which uses the schema API to add/modify a field with the useDocValuesAsStored parameter set to true (default), explicitly false and then true again?

          Show
          Shalin Shekhar Mangar added a comment - Changes: DocStreamer calculates dv fields to return in the constructor for every scenario instead of doing conditional logic and set intersection in next(). Also uses HashSet instead of Treeset (we don't care about order) Renamed nonStoredDVs in SolrIndexSearcher to allNonStoredDVs Added simple javadocs describing allNonStoredDVs and nonStoredDVsUsedAsStored Reformatted new code according to the project code style Fixed formatting in schema-nonstored-docvalues.xml which failed precommit (tabs instead of spaces) Added a basic sanity test in TestUseDocValuesAsStored.testNonStoredDocValueFieldsOnEmptyIndex to assert that no exceptions are thrown on unknown fields with various fl combinations. Ishan, can you please add a test which uses the schema API to add/modify a field with the useDocValuesAsStored parameter set to true (default), explicitly false and then true again?
          Hide
          Shalin Shekhar Mangar added a comment -

          With the right patch this time.

          Show
          Shalin Shekhar Mangar added a comment - With the right patch this time.
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Added a null check in DocStreamer so that we do not try to decorate dv fields if none are required
          2. Deletes all docs in testNonStoredDocValueFieldsOnEmptyIndex before running any test
          Show
          Shalin Shekhar Mangar added a comment - Added a null check in DocStreamer so that we do not try to decorate dv fields if none are required Deletes all docs in testNonStoredDocValueFieldsOnEmptyIndex before running any test
          Hide
          Erick Erickson added a comment -

          When we do the docs, we do need to include a warning about ordering. multiValued fields are returned in sorted order when returned from DV fields, not the original order, right?

          Up until now, we've promised that multiValued fields are returned in the same order they were inserted into the document so two MV fields can be treated like parallel arrays. This will not be true of DV fields that are multiValued, correct?

          Suggested text would be something like:

          "Returning stored fields from docValues (default in schema versions 1.6+) returns multiValued fields in sorted order. If you require the older behavior of multiValued fields being returned in the original insertion order, set useDocValuesAsStored="false" for the individual fields or make sure your schema version is < 1.6. This does not require re-indexing."

          Show
          Erick Erickson added a comment - When we do the docs, we do need to include a warning about ordering. multiValued fields are returned in sorted order when returned from DV fields, not the original order, right? Up until now, we've promised that multiValued fields are returned in the same order they were inserted into the document so two MV fields can be treated like parallel arrays. This will not be true of DV fields that are multiValued, correct? Suggested text would be something like: "Returning stored fields from docValues (default in schema versions 1.6+) returns multiValued fields in sorted order. If you require the older behavior of multiValued fields being returned in the original insertion order, set useDocValuesAsStored="false" for the individual fields or make sure your schema version is < 1.6. This does not require re-indexing."
          Hide
          Ishan Chattopadhyaya added a comment -

          Ishan, can you please add a test which uses the schema API to add/modify a field with the useDocValuesAsStored parameter set to true (default), explicitly false and then true again?

          Added two extra tests:

          1. TestUseDocValuesAsStored.testManagedSchema() for testing with managed schema.
          2. TestUseDocValuesAsStored2 suite for testing with Schema API.

          Also, renamed the test methods in TestUseDocValuesAsStored to shorter/simpler names.

          Show
          Ishan Chattopadhyaya added a comment - Ishan, can you please add a test which uses the schema API to add/modify a field with the useDocValuesAsStored parameter set to true (default), explicitly false and then true again? Added two extra tests: TestUseDocValuesAsStored.testManagedSchema() for testing with managed schema. TestUseDocValuesAsStored2 suite for testing with Schema API. Also, renamed the test methods in TestUseDocValuesAsStored to shorter/simpler names.
          Hide
          Ishan Chattopadhyaya added a comment -

          Erick, this patch (8220) returns DVs only for non-stored fields. So, before this change, the users had no way of returning their multivalued non-stored fields (even field() didn't work). However, your suggested text (as is) is spot on once we have this and 8344 in.

          For this issue, I think we should document this caveat to suggest that if insertion order for multivalued fields is important for you, then have your field stored=true (and that this work is not for you).

          Show
          Ishan Chattopadhyaya added a comment - Erick, this patch (8220) returns DVs only for non-stored fields. So, before this change, the users had no way of returning their multivalued non-stored fields (even field() didn't work). However, your suggested text (as is) is spot on once we have this and 8344 in. For this issue, I think we should document this caveat to suggest that if insertion order for multivalued fields is important for you, then have your field stored=true (and that this work is not for you).
          Hide
          Shalin Shekhar Mangar added a comment -

          Improved Ishan's new managed schema test to actually add documents and assert query responses in different scenarios.

          Just to recap, following is the behavior of "fl" with doc value fields with the latest patch:

          1. All doc value fields in schema version 1.6 are useDocValuesAsStored=true by default
          2. fl=`*` will return all stored=true and useDocValuesAsStored=true fields
          3. fl=`,a1` will return all stored=true and useDocValuesAsStored=true fields. If the 'a1' field is useDocValuesAsStored=false then it will *NOT be returned even though it was explicitly asked for.
          4. fl=`a*` will match the specified pattern against only stored=true and useDocValuesAsStored=true fields.
          5. fl=`a1` will return a1 as long as it is stored=true or docValues=true (i.e. useDocValuesAsStored does not matter if a field is explicitly requested)
          Show
          Shalin Shekhar Mangar added a comment - Improved Ishan's new managed schema test to actually add documents and assert query responses in different scenarios. Just to recap, following is the behavior of "fl" with doc value fields with the latest patch: All doc value fields in schema version 1.6 are useDocValuesAsStored=true by default fl=`*` will return all stored=true and useDocValuesAsStored=true fields fl=` ,a1` will return all stored=true and useDocValuesAsStored=true fields. If the 'a1' field is useDocValuesAsStored=false then it will *NOT be returned even though it was explicitly asked for. fl=`a*` will match the specified pattern against only stored=true and useDocValuesAsStored=true fields. fl=`a1` will return a1 as long as it is stored=true or docValues=true (i.e. useDocValuesAsStored does not matter if a field is explicitly requested)
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          Updating the patch:

          1. SolrIndexSearcher's getNonStoredDVs() method now returns unmodifiable sets of field names
          2. Added another method in ReturnFields and SolrReturnFields which returns the additionally requested field names, irrespective of a wantsAll (*) being present in the fl=. This is an overloaded getLuceneFieldNames() with a boolean parameter to "ignoreWantsAll". Is there something better we can name it to?
          3. Now fl=*,a3 case also returns a3, irrespective of whether a3 has useDocValuesAsStored. (This change was made in DocsStreamer's constructor, under:
                // add non-stored DV fields that may have been requested
                if (rctx.getReturnFields().wantsAllFields())  {
            

          Shalin Shekhar Mangar Please feel free to drop changes for points 2, 3, if you think this doesn't make sense from a usecase point of view.

          Show
          Ishan Chattopadhyaya added a comment - - edited Updating the patch: SolrIndexSearcher's getNonStoredDVs() method now returns unmodifiable sets of field names Added another method in ReturnFields and SolrReturnFields which returns the additionally requested field names, irrespective of a wantsAll ( * ) being present in the fl=. This is an overloaded getLuceneFieldNames() with a boolean parameter to "ignoreWantsAll". Is there something better we can name it to? Now fl=*,a3 case also returns a3, irrespective of whether a3 has useDocValuesAsStored. (This change was made in DocsStreamer's constructor, under: // add non-stored DV fields that may have been requested if (rctx.getReturnFields().wantsAllFields()) { Shalin Shekhar Mangar Please feel free to drop changes for points 2, 3, if you think this doesn't make sense from a usecase point of view.
          Hide
          Ishan Chattopadhyaya added a comment -

          Added a code comment to the previous patch. Plus very minor cleanup at DocsStreamer's constructor.

          Show
          Ishan Chattopadhyaya added a comment - Added a code comment to the previous patch. Plus very minor cleanup at DocsStreamer's constructor.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Ishan.

          Changes:

          1. Create unmodifiable map in SolrIndexSearcher's constructor so that we don't create unmodifiable instances on every call to SolrIndexSearcher.getNonStoredDVs
          2. Fixed javadocs for ReturnFields.getLuceneFieldNames to note that it can return null even if ignoreWantsAll=true
          3. Optimized field selection in DocsStreamer to create fewer objects. We iterate over requested field names and do intersection in the loop.
          Show
          Shalin Shekhar Mangar added a comment - Thanks Ishan. Changes: Create unmodifiable map in SolrIndexSearcher's constructor so that we don't create unmodifiable instances on every call to SolrIndexSearcher.getNonStoredDVs Fixed javadocs for ReturnFields.getLuceneFieldNames to note that it can return null even if ignoreWantsAll=true Optimized field selection in DocsStreamer to create fewer objects. We iterate over requested field names and do intersection in the loop.
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Removed extra license header in schema-non-stored-docvalues.xml which failed the parsing
          2. Fixed the code comment in TestUseDocValuesAsStored2.java that is no longer valid now that we return useDocValuesAsStored=false field if they are explicitly requested in addition to all fields.

          I'll commit this shortly.

          Show
          Shalin Shekhar Mangar added a comment - Removed extra license header in schema-non-stored-docvalues.xml which failed the parsing Fixed the code comment in TestUseDocValuesAsStored2.java that is no longer valid now that we return useDocValuesAsStored=false field if they are explicitly requested in addition to all fields. I'll commit this shortly.
          Hide
          Ishan Chattopadhyaya added a comment -

          +1, LGTM.

          Show
          Ishan Chattopadhyaya added a comment - +1, LGTM.
          Hide
          ASF subversion and git services added a comment -

          Commit 1721795 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1721795 ]

          SOLR-8220: Read field from DocValues for non stored fields

          Show
          ASF subversion and git services added a comment - Commit 1721795 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1721795 ] SOLR-8220 : Read field from DocValues for non stored fields
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          Thanks Shalin Shekhar Mangar for your commit, and Keith, Yonik and Erick for your inputs.

          I suggest we change:

          +  Also note that returning stored fields from docValues (default in schema versions 1.6+) returns multiValued
          +  fields in sorted order. If you require the older behavior of multiValued fields being returned in the
          +  original insertion order, set useDocValuesAsStored="false" for the individual fields or make
          +  sure your schema version is < 1.6. This does not require re-indexing.
          +  See SOLR-8220 for more details.
          

          to

          +  Also note that while returning non-stored fields from docValues (default in schema versions 1.6+, unless useDocValuesAsStored is false), the values of a multi-valued
          +  field are returned in sorted order. If you require the multi-valued fields to be returned in the
          +  original insertion order, then make your multi-valued field as stored. This requires re-indexing.
          +  See SOLR-8220 for more details.
           

          I think the first text block is relevant once SOLR-8344 goes in. In this issue, we're just introducing new behaviour of returning non-stored values from DVs.

          Show
          Ishan Chattopadhyaya added a comment - - edited Thanks Shalin Shekhar Mangar for your commit, and Keith, Yonik and Erick for your inputs. I suggest we change: + Also note that returning stored fields from docValues ( default in schema versions 1.6+) returns multiValued + fields in sorted order. If you require the older behavior of multiValued fields being returned in the + original insertion order, set useDocValuesAsStored= " false " for the individual fields or make + sure your schema version is < 1.6. This does not require re-indexing. + See SOLR-8220 for more details. to + Also note that while returning non-stored fields from docValues ( default in schema versions 1.6+, unless useDocValuesAsStored is false ), the values of a multi-valued + field are returned in sorted order. If you require the multi-valued fields to be returned in the + original insertion order, then make your multi-valued field as stored. This requires re-indexing. + See SOLR-8220 for more details. I think the first text block is relevant once SOLR-8344 goes in. In this issue, we're just introducing new behaviour of returning non-stored values from DVs.
          Hide
          ASF subversion and git services added a comment -

          Commit 1721808 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1721808 ]

          SOLR-8220: Improve upgrade notes in CHANGES.txt

          Show
          ASF subversion and git services added a comment - Commit 1721808 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1721808 ] SOLR-8220 : Improve upgrade notes in CHANGES.txt
          Hide
          Shalin Shekhar Mangar added a comment -

          Patch for branch_5x.

          Show
          Shalin Shekhar Mangar added a comment - Patch for branch_5x.
          Hide
          David Smiley added a comment -

          I've been following this since inception with passing interest and wish I had the time this holiday to comment before Shalin committing. Based on Shalin's summary a couple days ago, fl=* will return these doc-values only fields that have useDocValuesAsStored=true, and it will be true by default going forward. My only concern with this is that it's common-place to index an original input value of text 2 ways – one for keyword search (marked stored as well), and another copyField target that isn't stored, isn't indexed, but has docValues for faceting or sorting. Now, this value will be returned twice from fl=*. Granted the user could set useDocValuesAsStored=false on these fields, and that's not a big deal to do so nor a big deal to forget to do so. Is this not lost on Ishan & Shalin who are putting the work into this issue or is this just a recognized trade-off? It didn't have to be this way. It could have designed such that fl=* is only for stored fields and those explicitly setting useDocValuesAsStored=true.

          Show
          David Smiley added a comment - I've been following this since inception with passing interest and wish I had the time this holiday to comment before Shalin committing. Based on Shalin's summary a couple days ago, fl=* will return these doc-values only fields that have useDocValuesAsStored=true, and it will be true by default going forward. My only concern with this is that it's common-place to index an original input value of text 2 ways – one for keyword search (marked stored as well), and another copyField target that isn't stored, isn't indexed, but has docValues for faceting or sorting. Now, this value will be returned twice from fl=* . Granted the user could set useDocValuesAsStored=false on these fields, and that's not a big deal to do so nor a big deal to forget to do so. Is this not lost on Ishan & Shalin who are putting the work into this issue or is this just a recognized trade-off? It didn't have to be this way. It could have designed such that fl=* is only for stored fields and those explicitly setting useDocValuesAsStored=true.
          Hide
          Ishan Chattopadhyaya added a comment -

          David, do you think having the copyField targets to have useDocValuesAsStored as false in our example schemas partly alleviates the problem?

          In fact, I was planning to open another issue to add a few more dynamic field types that have stored=false, docValues=true in the example schemas. I can add this change to copyFields too, if it makes sense.

          Or, rather, do you prefer useDocValuesAsStored to be false by default and turned on on-demand? I think that will make adoption harder, and make it harder for us (or iow more of an abrupt change for user) to get rid of stored fields (even if it makes sense performance wise some day). Having this as the default now (i.e. useDocValuesAsStored=true) would make the transition less abrupt. What do you think?

          Show
          Ishan Chattopadhyaya added a comment - David, do you think having the copyField targets to have useDocValuesAsStored as false in our example schemas partly alleviates the problem? In fact, I was planning to open another issue to add a few more dynamic field types that have stored=false, docValues=true in the example schemas. I can add this change to copyFields too, if it makes sense. Or, rather, do you prefer useDocValuesAsStored to be false by default and turned on on-demand? I think that will make adoption harder, and make it harder for us (or iow more of an abrupt change for user) to get rid of stored fields (even if it makes sense performance wise some day). Having this as the default now (i.e. useDocValuesAsStored=true) would make the transition less abrupt. What do you think?
          Hide
          David Smiley added a comment -

          Rule #5 is very surprising to me, as it seems to conflict with #3 in that it ignores useDocValuesAsStored. What is the rationale? If the field is both Stored And DV then from where is it returned? I REALLY hope not DV since it then wouldn't respect value ordering.

          Show
          David Smiley added a comment - Rule #5 is very surprising to me, as it seems to conflict with #3 in that it ignores useDocValuesAsStored. What is the rationale? If the field is both Stored And DV then from where is it returned? I REALLY hope not DV since it then wouldn't respect value ordering.
          Hide
          Shalin Shekhar Mangar added a comment -

          The idea is that if you have explicitly asked for that field then we find a way to return it, if possible. If it is both stored and DV then it returns from stored.

          Show
          Shalin Shekhar Mangar added a comment - The idea is that if you have explicitly asked for that field then we find a way to return it, if possible. If it is both stored and DV then it returns from stored.
          Hide
          Shalin Shekhar Mangar added a comment -

          It didn't have to be this way. It could have designed such that fl=* is only for stored fields and those explicitly setting useDocValuesAsStored=true.

          Let me take a step back. The way I think about useDocValuesAsStored is that it gives a hint to Solr that this field can be retrieved from DocValues. But that doesn't necessarily mean that we will always retrieve it from DocValues. Later, in SOLR-8344 we will implement some heuristics to choose whether to retrieve from stored or from DV if both are enabled.

          So, by making useDocValuesAsStored=true as default, we don't tie our hands in SOLR-8344 and we'd be free to choose as we desire. But if the user wants a field to never be automatically retrieved from DocValues then he can set useDocValuesAsStored=false. Keep in mind that this is only for automatic selection and if the user wants to explicitly retrieve a field by specifying its full name (no globbing) in the 'fl' parameter then we respect his/her wishes and retrieve from DV if necessary.

          Maybe useDocValuesAsStored is a bad name and 'autoDocValuesAsStored' conveys the meaning better?

          Show
          Shalin Shekhar Mangar added a comment - It didn't have to be this way. It could have designed such that fl=* is only for stored fields and those explicitly setting useDocValuesAsStored=true. Let me take a step back. The way I think about useDocValuesAsStored is that it gives a hint to Solr that this field can be retrieved from DocValues. But that doesn't necessarily mean that we will always retrieve it from DocValues. Later, in SOLR-8344 we will implement some heuristics to choose whether to retrieve from stored or from DV if both are enabled. So, by making useDocValuesAsStored=true as default, we don't tie our hands in SOLR-8344 and we'd be free to choose as we desire. But if the user wants a field to never be automatically retrieved from DocValues then he can set useDocValuesAsStored=false. Keep in mind that this is only for automatic selection and if the user wants to explicitly retrieve a field by specifying its full name (no globbing) in the 'fl' parameter then we respect his/her wishes and retrieve from DV if necessary. Maybe useDocValuesAsStored is a bad name and 'autoDocValuesAsStored' conveys the meaning better?
          Hide
          ASF subversion and git services added a comment -

          Commit 1721844 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1721844 ]

          SOLR-8220: Read field from DocValues for non stored fields

          Show
          ASF subversion and git services added a comment - Commit 1721844 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1721844 ] SOLR-8220 : Read field from DocValues for non stored fields
          Hide
          David Smiley added a comment -

          Ok then Rule #5 makes perfect sense to me. #3 is questionable to me; hard to clearly explain the rules to someone. So useDocValuesAsStored is only considered when FL contains an asterisk? If so, I wonder if it should be named to reflect that somehow. Like "flWildcard". With a name like that, it would be useful to set to false on a stored field.

          Show
          David Smiley added a comment - Ok then Rule #5 makes perfect sense to me. #3 is questionable to me; hard to clearly explain the rules to someone. So useDocValuesAsStored is only considered when FL contains an asterisk? If so, I wonder if it should be named to reflect that somehow. Like "flWildcard". With a name like that, it would be useful to set to false on a stored field.
          Hide
          Shalin Shekhar Mangar added a comment -

          #3 was fixed in later patches. In the committed code, if you request a field explicitly then it will be returned either from stored or from DV.

          Show
          Shalin Shekhar Mangar added a comment - #3 was fixed in later patches. In the committed code, if you request a field explicitly then it will be returned either from stored or from DV.
          Hide
          David Smiley added a comment -

          How about two Boolean settings:

          • matchesFlGlob defaults to whatever stored is. Very straight forward to describe; no special exceptions. Useful to set to either true or false in different circumstances depending if it's stored or DV. This would be a separate issue.
          • autoDocValuesAsStored defaults to whatever docValues is. I'm not sure how to define it honestly other than to say it doesn't have to do with fl globbing. Maybe it could prevent fl choosing the field even if explicitly ?
          Show
          David Smiley added a comment - How about two Boolean settings: matchesFlGlob defaults to whatever stored is. Very straight forward to describe; no special exceptions. Useful to set to either true or false in different circumstances depending if it's stored or DV. This would be a separate issue. autoDocValuesAsStored defaults to whatever docValues is. I'm not sure how to define it honestly other than to say it doesn't have to do with fl globbing. Maybe it could prevent fl choosing the field even if explicitly ?
          Hide
          Yonik Seeley added a comment -

          copyField target that isn't stored, isn't indexed, but has docValues for faceting or sorting.

          Going forward, why wouldn't one just use docValues on the original field?
          Anyway the copyField thing presents ambiguities for atomic updates as well... it's not specific to "fl". Whatever we support for that can be used for "fl" as well (for instance, determinging that the copyField target isn't a "real" field because the source is already stored/docValued or something).

          It seems like going forward, people will benefit by adjusting their mental model to "it's just a different way of storing... row stored or column stored." To a new user, why would one not get all stored fields back? If column stored option had been present in Lucene from the start, that's probably how it would have been implemented in Solr from the start.

          Show
          Yonik Seeley added a comment - copyField target that isn't stored, isn't indexed, but has docValues for faceting or sorting. Going forward, why wouldn't one just use docValues on the original field? Anyway the copyField thing presents ambiguities for atomic updates as well... it's not specific to "fl". Whatever we support for that can be used for "fl" as well (for instance, determinging that the copyField target isn't a "real" field because the source is already stored/docValued or something). It seems like going forward, people will benefit by adjusting their mental model to "it's just a different way of storing... row stored or column stored." To a new user, why would one not get all stored fields back? If column stored option had been present in Lucene from the start, that's probably how it would have been implemented in Solr from the start.
          Hide
          Shalin Shekhar Mangar added a comment -

          matchesFlGlob defaults to whatever stored is. Very straight forward to describe; no special exceptions. Useful to set to either true or false in different circumstances depending if it's stored or DV. This would be a separate issue.

          I don't think we need that level of configurability? Keep in mind that every parameter that we add also needs to be explained to a new user. Having trained people on Solr, explaining the difference between stored/indexed/docvalues is confusing enough. We should avoid adding more complexity if we can. This is another reason why I wanted useDocValuesAsStored to be true by default and un-specified/hidden in all default schemas.

          Also I don't really understand your reasons for adding such a parameter. Globbing is allowed on doc values field as long as useDocValuesAsStored=true. In the committed patch, you can do fl=a,b,c,d* and have d* match all docvalues fields which are useDocValuesAsStored=true. But if you set useDocValuesAsStored=false then globbing will not work. What am I missing?

          Show
          Shalin Shekhar Mangar added a comment - matchesFlGlob defaults to whatever stored is. Very straight forward to describe; no special exceptions. Useful to set to either true or false in different circumstances depending if it's stored or DV. This would be a separate issue. I don't think we need that level of configurability? Keep in mind that every parameter that we add also needs to be explained to a new user. Having trained people on Solr, explaining the difference between stored/indexed/docvalues is confusing enough. We should avoid adding more complexity if we can. This is another reason why I wanted useDocValuesAsStored to be true by default and un-specified/hidden in all default schemas. Also I don't really understand your reasons for adding such a parameter. Globbing is allowed on doc values field as long as useDocValuesAsStored=true . In the committed patch, you can do fl=a,b,c,d* and have d* match all docvalues fields which are useDocValuesAsStored=true . But if you set useDocValuesAsStored=false then globbing will not work. What am I missing?
          Hide
          Ishan Chattopadhyaya added a comment -

          Right, in the last patch from me (and maybe the second last too) [0], I introduced returning any additional fields with * even if those additional fields are non-stored DVs with useDocValuesAsStored=false. Hence, Shalin's point 3 now should read something like:

          3. fl=*,a1 will return all stored=true fields and useDocValuesAsStored=true DV fields. If the 'a1' field is a stored=false DV field with useDocValuesAsStored=false then it will also be returned because it was explicitly asked for.

          [0] - https://issues.apache.org/jira/browse/SOLR-8220?focusedCommentId=15071474&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15071474

          Show
          Ishan Chattopadhyaya added a comment - Right, in the last patch from me (and maybe the second last too) [0] , I introduced returning any additional fields with * even if those additional fields are non-stored DVs with useDocValuesAsStored=false. Hence, Shalin's point 3 now should read something like: 3. fl=*,a1 will return all stored=true fields and useDocValuesAsStored=true DV fields. If the 'a1' field is a stored=false DV field with useDocValuesAsStored=false then it will also be returned because it was explicitly asked for. [0] - https://issues.apache.org/jira/browse/SOLR-8220?focusedCommentId=15071474&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15071474
          Hide
          David Smiley added a comment -

          (Yonik) Going forward, why wouldn't one just use docValues on the original field?

          The main reason is that for returning the top-X docs with more than a few fields, row storage will probably be faster than columnar storage, performance-wise. So at index time you can pay for both if you need columnar for other reasons (sorting/faceting). Another reason is to retain a particular ordering for multi-valued fields. Another reason is that Solr's highlighters don't read from docValues (solvable).

          (Ishan) David, do you think having the copyField targets to have useDocValuesAsStored as false in our example schemas partly alleviates the problem?

          Yes, we should do that. Ideally most users wouldn't want to monkey with such parameters (IMO). But most schemas I've seen have at least one occurrence of an original input string indexed two ways for search & sorting/faceting. And if our example schemas do, thus motivating us to set it to false for these fields, it just re-confirms my point.

          Shalin: I very much care about ease of documenting/explaining this; I thought my comments showed I care. I guess we just see this issue differently. I'm coming around to a new interpretation of what useDocValuesAsStored is, as it was committed and today clarified by you and Ishan. It basically means will a 'fl' glob match the DV field or not. If my understanding is true, then I think this is evidence I'm on to something with my "matchesFlGlob" suggestion. You are free to disagree but I think it's extremely easy to document/describe/teach etc. what matchesFlGlob means, particularly if it's scope is expanded to apply to stored fields too.

          FWIW I'll be on IRC. My attempts to ping you haven't received a response.

          Show
          David Smiley added a comment - (Yonik) Going forward, why wouldn't one just use docValues on the original field? The main reason is that for returning the top-X docs with more than a few fields, row storage will probably be faster than columnar storage, performance-wise. So at index time you can pay for both if you need columnar for other reasons (sorting/faceting). Another reason is to retain a particular ordering for multi-valued fields. Another reason is that Solr's highlighters don't read from docValues (solvable). (Ishan) David, do you think having the copyField targets to have useDocValuesAsStored as false in our example schemas partly alleviates the problem? Yes, we should do that. Ideally most users wouldn't want to monkey with such parameters (IMO). But most schemas I've seen have at least one occurrence of an original input string indexed two ways for search & sorting/faceting. And if our example schemas do, thus motivating us to set it to false for these fields, it just re-confirms my point. Shalin: I very much care about ease of documenting/explaining this; I thought my comments showed I care. I guess we just see this issue differently. I'm coming around to a new interpretation of what useDocValuesAsStored is, as it was committed and today clarified by you and Ishan. It basically means will a 'fl' glob match the DV field or not. If my understanding is true, then I think this is evidence I'm on to something with my "matchesFlGlob" suggestion. You are free to disagree but I think it's extremely easy to document/describe/teach etc. what matchesFlGlob means, particularly if it's scope is expanded to apply to stored fields too. FWIW I'll be on IRC. My attempts to ping you haven't received a response.
          Hide
          Shalin Shekhar Mangar added a comment -

          I had a chat with David on IRC. In summary, his objections are:

          1. The primary is the default — are stored=false & docValues=true returned from globs by default (whatever we name the field/fieldtype attributes). It would be set to false often, assuming many schemas index text both ways?
          2. Globs only match a DV field if useDocValuesAsStored=true and so the property name should be indicative of that
          3. Perhaps if we have a matchesFlGlob parameter we would then have no need for useDocValuesAsStored?

          I was of the opinion that matchesFlGlob=true is not a great name because what if we use JSON request API in future and there's no "fl" so to speak. David then also suggested an alternate name e.g. matchesReturnGlob: whether a glob (asterisk) pattern in an ‘fl’ parameter (or equivalent) will match this field. Applies to stored or docValues fields. Defaults to what stored is set to.

          I'm still not sure about this, David. What does stored=true, matchesReturnGlob=false mean? Should that even be allowed?

          Replying to some of your earlier points:

          The main reason is that for returning the top-X docs with more than a few fields, row storage will probably be faster than columnar storage, performance-wise.

          Agreed, and that's why we need to figure out the specifics in SOLR-8344 so that we choose the right way of retrieval.

          Another reason is to retain a particular ordering for multi-valued fields

          We can make always try to use stored values for multi-valued fields in SOLR-8344.

          Another reason is that Solr's highlighters don't read from docValues (solvable).

          Why is this a reason for not using DocValues on the original field?

          Ideally most users wouldn't want to monkey with such parameters (IMO). But most schemas I've seen have at least one occurrence of an original input string indexed two ways for search & sorting/faceting. And if our example schemas do, thus motivating us to set it to false for these fields, it just re-confirms my point.

          The thing is the copy fields you refer to are deliberately added and I think it should be okay to expect that users will choose the value for useDocValuesAsStored according to their use-cases for such fields. It is also common for people to have docValues=true and stored=true together just because they believe that it isn't possible to retrieve doc values.

          What do you think?

          Show
          Shalin Shekhar Mangar added a comment - I had a chat with David on IRC. In summary, his objections are: The primary is the default — are stored=false & docValues=true returned from globs by default (whatever we name the field/fieldtype attributes). It would be set to false often, assuming many schemas index text both ways? Globs only match a DV field if useDocValuesAsStored=true and so the property name should be indicative of that Perhaps if we have a matchesFlGlob parameter we would then have no need for useDocValuesAsStored? I was of the opinion that matchesFlGlob=true is not a great name because what if we use JSON request API in future and there's no "fl" so to speak. David then also suggested an alternate name e.g. matchesReturnGlob: whether a glob (asterisk) pattern in an ‘fl’ parameter (or equivalent) will match this field. Applies to stored or docValues fields. Defaults to what stored is set to. I'm still not sure about this, David. What does stored=true, matchesReturnGlob=false mean? Should that even be allowed? Replying to some of your earlier points: The main reason is that for returning the top-X docs with more than a few fields, row storage will probably be faster than columnar storage, performance-wise. Agreed, and that's why we need to figure out the specifics in SOLR-8344 so that we choose the right way of retrieval. Another reason is to retain a particular ordering for multi-valued fields We can make always try to use stored values for multi-valued fields in SOLR-8344 . Another reason is that Solr's highlighters don't read from docValues (solvable). Why is this a reason for not using DocValues on the original field? Ideally most users wouldn't want to monkey with such parameters (IMO). But most schemas I've seen have at least one occurrence of an original input string indexed two ways for search & sorting/faceting. And if our example schemas do, thus motivating us to set it to false for these fields, it just re-confirms my point. The thing is the copy fields you refer to are deliberately added and I think it should be okay to expect that users will choose the value for useDocValuesAsStored according to their use-cases for such fields. It is also common for people to have docValues=true and stored=true together just because they believe that it isn't possible to retrieve doc values. What do you think?
          Hide
          Yonik Seeley added a comment -

          > (Yonik) Going forward, why wouldn't one just use docValues on the original field [ rather than a copyField ] ?
          The main reason is that for returning the top-X docs with more than a few fields, row storage will probably be faster than columnar storage, performance-wise.

          If one still wants to use row-stored for performance tweaks (the exact cross-over point will hopefully be determined by SOLR-8344), then they can still do that, and it makes sense to do both row-stored and column-stored on the original field. Sorry if it wasn't clear, but that was my original point.

          It [ useDocValuesAsStored ] basically means will a 'fl' glob match the DV field or not.

          That's only one aspect. Every place that "uses" stored fields should treat DV field as a stored field. It also affects streaming expressions / SQL, highlighting, partial updates, etc.

          Although I can see why you might think that at first - the fact that the current implementation does return fl=exact_field_name even when useDocValuesAsStored=false was a surprise to me as well. I don't see it as a big deal though (more like a minor syntactic shortcut for a pseudo field).

          To me, useDocValuesAsStored=false means "this isn't really a stored field... it's just an implementation artifact for some query-time feature we need".

          Show
          Yonik Seeley added a comment - > (Yonik) Going forward, why wouldn't one just use docValues on the original field [ rather than a copyField ] ? The main reason is that for returning the top-X docs with more than a few fields, row storage will probably be faster than columnar storage, performance-wise. If one still wants to use row-stored for performance tweaks (the exact cross-over point will hopefully be determined by SOLR-8344 ), then they can still do that, and it makes sense to do both row-stored and column-stored on the original field. Sorry if it wasn't clear, but that was my original point. It [ useDocValuesAsStored ] basically means will a 'fl' glob match the DV field or not. That's only one aspect. Every place that "uses" stored fields should treat DV field as a stored field. It also affects streaming expressions / SQL, highlighting, partial updates, etc. Although I can see why you might think that at first - the fact that the current implementation does return fl=exact_field_name even when useDocValuesAsStored=false was a surprise to me as well. I don't see it as a big deal though (more like a minor syntactic shortcut for a pseudo field). To me, useDocValuesAsStored=false means "this isn't really a stored field... it's just an implementation artifact for some query-time feature we need".
          Hide
          David Smiley added a comment -

          To be clear I'm not -1 on this; just -0.

          I get the gist of the intent with what is committed. Thanks for clarifying guys.

          I think what may reduce the need for users to know about / touch useDocValuesAsStored (what I hope for) is if conventionally, stored & doc-values fields are the same field, and then we put tokenized text into some other field, indexed=true stored=false docValues=false if we also need keyword search on the field in question. I think this is what Yonik is suggesting. The only annoyance with this is highlighting – the highlighters expect the stored text to be at the same field name as both the query & index. hl.requireFieldMatch could be set to false but that's a blunt instrument and isn't even supported by the postings highlighter. Nonetheless I think this could be solved in another issue.

          Can and should the default/example schemas be adjusted to fit the aforementioned conventions? Then we wouldn't have any want to set useDocValuesAsStored in them.

          Show
          David Smiley added a comment - To be clear I'm not -1 on this; just -0. I get the gist of the intent with what is committed. Thanks for clarifying guys. I think what may reduce the need for users to know about / touch useDocValuesAsStored (what I hope for) is if conventionally, stored & doc-values fields are the same field, and then we put tokenized text into some other field, indexed=true stored=false docValues=false if we also need keyword search on the field in question. I think this is what Yonik is suggesting. The only annoyance with this is highlighting – the highlighters expect the stored text to be at the same field name as both the query & index. hl.requireFieldMatch could be set to false but that's a blunt instrument and isn't even supported by the postings highlighter. Nonetheless I think this could be solved in another issue. Can and should the default/example schemas be adjusted to fit the aforementioned conventions? Then we wouldn't have any want to set useDocValuesAsStored in them.
          Hide
          Erick Erickson added a comment -

          Hmmm, one implication that I just realized (yeah, I'm slow sometimes). In addition to the output from a DV field being reordered, identical values are collapsed since it's a SortedSet under the covers, right? So if I have a MultiValued DocValues field and put in "memory", "memory", "memory", then all I get back from the DV field is one copy of "memory".

          Since the guarantee that multiValued fields return data in the same order inserted when returning the Stored value is not true of returning DV values, this is perhaps no biggie. IMO it does deserve a comment when we do the docs for this though.

          Show
          Erick Erickson added a comment - Hmmm, one implication that I just realized (yeah, I'm slow sometimes). In addition to the output from a DV field being reordered, identical values are collapsed since it's a SortedSet under the covers, right? So if I have a MultiValued DocValues field and put in "memory", "memory", "memory", then all I get back from the DV field is one copy of "memory". Since the guarantee that multiValued fields return data in the same order inserted when returning the Stored value is not true of returning DV values, this is perhaps no biggie. IMO it does deserve a comment when we do the docs for this though.
          Hide
          Yonik Seeley added a comment -

          Yep... I commented earlier that we should prob add an explicit "set" type/flag in the future. It can make sense to have a field that preserves order/instances and one that doesn't.

          Show
          Yonik Seeley added a comment - Yep... I commented earlier that we should prob add an explicit "set" type/flag in the future. It can make sense to have a field that preserves order/instances and one that doesn't.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          Notes for the reference guide:

          Page: https://cwiki.apache.org/confluence/display/solr/Defining+Fields

          Property=useDocValuesAsStored
          Description=If the field has docValues enabled, setting this to true would allow the field to be treated as regular stored fields (even if it has stored=false). This means that this field would be returned alongside regular stored fields that are returned using the fl parameter.
          Values=true or false
          Implicit default=false for schema versions <1.6, true for schema versions >=1.6
          

          Page: https://cwiki.apache.org/confluence/display/solr/DocValues

          <New section> Retrieving docValues during search:
          
          Field values retrieved during search queries are typically returned from stored values. However, starting with schema version 1.6, all non-stored docValues fields will be also returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g. fl=*) for search queries. This behavior can be turned on and off by setting useDocValuesAsStored parameter for a field or a field type to true (implicit default since schema version 1.6) or false (implicit default till schema version 1.5). See https://cwiki.apache.org/confluence/display/solr/Defining+Fields
          
          Note that enabling this property has performance implications because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from docValues (default in schema versions 1.6+, unless useDocValuesAsStored is false), the values of a multi-valued field are returned in sorted order (and not insertion order). If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires re-indexing).
          

          Page: https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefl%28FieldList%29Parameter

          Note: Starting with schema version 1.6, if there are non-stored fields with docValues enabled in the index, then a pattern glob like * in the fl parameter will retrieve those fields. This is not the case if those fields have explicitly useDocValuesAsStored as false in their field definition (see https://cwiki.apache.org/confluence/display/solr/Defining+Fields) or the schema version is <1.6. However, something like fl=dvfield or fl=*,dvfield (say dvfield is a non-stored field with docValues enabled) would retrieve the dvfield irrespective of the useDocValuesAsStored value. (See SOLR-8220 for more details)
          

          Could someone please review and update the ref guide with the above information? And please feel free to reorganize, modify, drop, or rephrase any of this.

          Show
          Ishan Chattopadhyaya added a comment - - edited Notes for the reference guide: Page: https://cwiki.apache.org/confluence/display/solr/Defining+Fields Property=useDocValuesAsStored Description=If the field has docValues enabled, setting this to true would allow the field to be treated as regular stored fields (even if it has stored= false ). This means that this field would be returned alongside regular stored fields that are returned using the fl parameter. Values= true or false Implicit default = false for schema versions <1.6, true for schema versions >=1.6 Page: https://cwiki.apache.org/confluence/display/solr/DocValues <New section> Retrieving docValues during search: Field values retrieved during search queries are typically returned from stored values. However, starting with schema version 1.6, all non-stored docValues fields will be also returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g. fl=*) for search queries. This behavior can be turned on and off by setting useDocValuesAsStored parameter for a field or a field type to true (implicit default since schema version 1.6) or false (implicit default till schema version 1.5). See https: //cwiki.apache.org/confluence/display/solr/Defining+Fields Note that enabling this property has performance implications because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from docValues ( default in schema versions 1.6+, unless useDocValuesAsStored is false ), the values of a multi-valued field are returned in sorted order (and not insertion order). If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires re-indexing). Page: https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefl%28FieldList%29Parameter Note: Starting with schema version 1.6, if there are non-stored fields with docValues enabled in the index, then a pattern glob like * in the fl parameter will retrieve those fields. This is not the case if those fields have explicitly useDocValuesAsStored as false in their field definition (see https: //cwiki.apache.org/confluence/display/solr/Defining+Fields) or the schema version is <1.6. However, something like fl=dvfield or fl=*,dvfield (say dvfield is a non-stored field with docValues enabled) would retrieve the dvfield irrespective of the useDocValuesAsStored value. (See SOLR-8220 for more details) Could someone please review and update the ref guide with the above information? And please feel free to reorganize, modify, drop, or rephrase any of this.
          Hide
          Shalin Shekhar Mangar added a comment -

          This was released in 5.5 so I am resolving this.

          David Smiley – please open new issues if you want to change this feature for 6.0 in any way.

          Show
          Shalin Shekhar Mangar added a comment - This was released in 5.5 so I am resolving this. David Smiley – please open new issues if you want to change this feature for 6.0 in any way.

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Keith Laban
            • Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development