Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8344

Decide default when requested fields are both column and row stored.

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (8.0), 7.1
    • Component/s: None
    • Labels:
      None

      Description

      This issue was discussed in the comments at SOLR-8220. Splitting it out to a separate issue so that we can have a focused discussion on whether/how to do this.

      If a given set of requested fields are all stored and have docValues (column stored), we can retrieve the values from either place. What should the default be?

      1. SOLR-8344.patch
        16 kB
        Cao Manh Dat
      2. SOLR-8344.patch
        16 kB
        Cao Manh Dat
      3. SOLR-8344.patch
        16 kB
        Cao Manh Dat

        Issue Links

          Activity

          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          But are you also arguing for always loading fields from docvalues even if they are stored?

          If a client requests fl=a,b,c (and these three fields all have docvalues and are stored), it may be slower using docvalues if they are not cached yet.
          The question then becomes.... why are they not cached?

          • this is a one-off query, the docValues are not normally used
            • this is a case we should not be optimizing too much for
          • this is going to be a very common query
            • in this case, we should use docvalues anyway.... the average latency will drop as things get cached.

          If we're requesting a large result set, it probably makes sense to use docvalues.... every cache miss brings in 4K of that column, so subsequent accesses will become less likely to miss (vs the same scenario in stored fields). If the sort is by _docid_ then access will even be linear, meaning there will be few cache misses. OS read-ahead being triggered will reduce that even further.

          If the index is so massive that the docvalues for these three fields can't be cached for the random access case, then how will docvalues compare to stored values?
          With a disk-seek-per-doc-access, this is going to be a slow system regardless, and very specialized (i.e. if one can't effectively cache these fields, then things like sorting/faceting on these fields will be slow as well).

          Based on what we know now, it feels like docValues is the right default.
          Benchmarking to verify our assumptions would be a good thing.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - But are you also arguing for always loading fields from docvalues even if they are stored? If a client requests fl=a,b,c (and these three fields all have docvalues and are stored), it may be slower using docvalues if they are not cached yet. The question then becomes.... why are they not cached? this is a one-off query, the docValues are not normally used this is a case we should not be optimizing too much for this is going to be a very common query in this case, we should use docvalues anyway.... the average latency will drop as things get cached. If we're requesting a large result set, it probably makes sense to use docvalues.... every cache miss brings in 4K of that column, so subsequent accesses will become less likely to miss (vs the same scenario in stored fields). If the sort is by _docid_ then access will even be linear, meaning there will be few cache misses. OS read-ahead being triggered will reduce that even further. If the index is so massive that the docvalues for these three fields can't be cached for the random access case, then how will docvalues compare to stored values? With a disk-seek-per-doc-access, this is going to be a slow system regardless, and very specialized (i.e. if one can't effectively cache these fields, then things like sorting/faceting on these fields will be slow as well). Based on what we know now, it feels like docValues is the right default. Benchmarking to verify our assumptions would be a good thing.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Also highly related is what our schema defaults should be.

          For example, the fields we add as docValue by default (say dynamic fields like *_i), should they also be row-stored (stored="true") or not?

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Also highly related is what our schema defaults should be. For example, the fields we add as docValue by default (say dynamic fields like *_i), should they also be row-stored (stored="true") or not?
          Hide
          erickerickson Erick Erickson added a comment -

          Moving my comment from SOLR-8220 here, didn't see this ticket before pasting my comment there:

          My initial take would be something like add a new possibility to stored, something like stored="true|false|docvalues".

          No guessing involved then. stored="docvalues" would NOT put anything in the fdt file for this field. We should probably error out if stored="docvalues" and docValues="false".

          If it was important to get 2, not 1.999999999999 then there would need to be a stored=true variant of the field.

          **************Another take...
          Should we consider disallowing setting both? I.e. stored="true" docValues="true" throws an error...

          Now it's totally under the control of the user which path is chosen through the schema definition; we don't have to try to guess anything. No new syntax either. Maybe with a new "best practice" or something. There would be a learning curve for users around using only docValues=true for efficiency and not setting stored=true.

          That said, I suspect that the break in back-compat is too expensive and I think stored=docvalues is better anyway....

          Show
          erickerickson Erick Erickson added a comment - Moving my comment from SOLR-8220 here, didn't see this ticket before pasting my comment there: My initial take would be something like add a new possibility to stored, something like stored="true|false|docvalues". No guessing involved then. stored="docvalues" would NOT put anything in the fdt file for this field. We should probably error out if stored="docvalues" and docValues="false". If it was important to get 2, not 1.999999999999 then there would need to be a stored=true variant of the field. **************Another take... Should we consider disallowing setting both? I.e. stored="true" docValues="true" throws an error... Now it's totally under the control of the user which path is chosen through the schema definition; we don't have to try to guess anything. No new syntax either. Maybe with a new "best practice" or something. There would be a learning curve for users around using only docValues=true for efficiency and not setting stored=true. That said, I suspect that the break in back-compat is too expensive and I think stored=docvalues is better anyway....
          Hide
          yseeley@gmail.com Yonik Seeley added a comment - - edited

          Not quite sure what to do if the user defines both however, perhaps use the stored value?

          That's what this issue is about.

          Why are we trying to anticipate "the right thing to do"?

          They are both "right" (semantically equivalent), so we're trying to figure out the best performing option. That will change based on the request, and always (statically) choosing one over the other is never going to be optimal.

          Example: If a field is column-stored (docValued), one could chose to add it as stored so that it would be quicker to retrieve along with other non-docvalue fields (if you need to go to stored fields anyway, just get all the field values you can from there). But when accessing that field alone, we'd still certainly want to use the column... it will be much faster.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - - edited Not quite sure what to do if the user defines both however, perhaps use the stored value? That's what this issue is about. Why are we trying to anticipate "the right thing to do"? They are both "right" (semantically equivalent), so we're trying to figure out the best performing option. That will change based on the request, and always (statically) choosing one over the other is never going to be optimal. Example: If a field is column-stored (docValued), one could chose to add it as stored so that it would be quicker to retrieve along with other non-docvalue fields (if you need to go to stored fields anyway, just get all the field values you can from there). But when accessing that field alone, we'd still certainly want to use the column... it will be much faster.
          Hide
          erickerickson Erick Erickson added a comment -

          Yeah, removed that all from 8220, thought about it more and saw this ticket...

          We do want to keep in mind though that from a user perspective, inputting '2.178734872984723984729387" as a float and getting back "2.17873477935791" are not equivalent. So whatever we do needs to take that into account. If we use the stored=docvalues approach, then they'd need a copyField with stored=true field if it was important to them.

          The other thing I like about the stored=docvalues approach, existing schemas/applications continue to work exactly as they do now. Upgrading is not surprising. Users can selectively take advantage of this new capability.

          I think it'd even work if a user changed an existing schema from stored=true to stored=docvalues and did not re-index. True, they'd be carrying around some useless data in the *.fdt files until it was merged away for updated docs, but as long as docValues=true was set in the original schema, it'd "do the expected thing".

          I suppose the same functionality could come from a new parameter on a <field> definition like fl_from_dv=true but aside from the awkward name, I think this is harder to wrap your head around.

          If you were willing to pay the "wasted disk space" penalty though, a separate parameter on the <field> to control where the "stored" value came from would allow one to switch back and forth without re-indexing. But to be clear, I don't think this is a good idea; using the stored=docvalues option forces a decision to be made up-front.

          Show
          erickerickson Erick Erickson added a comment - Yeah, removed that all from 8220, thought about it more and saw this ticket... We do want to keep in mind though that from a user perspective, inputting '2.178734872984723984729387" as a float and getting back "2.17873477935791" are not equivalent. So whatever we do needs to take that into account. If we use the stored=docvalues approach, then they'd need a copyField with stored=true field if it was important to them. The other thing I like about the stored=docvalues approach, existing schemas/applications continue to work exactly as they do now. Upgrading is not surprising. Users can selectively take advantage of this new capability. I think it'd even work if a user changed an existing schema from stored=true to stored=docvalues and did not re-index. True, they'd be carrying around some useless data in the *.fdt files until it was merged away for updated docs, but as long as docValues=true was set in the original schema, it'd "do the expected thing". I suppose the same functionality could come from a new parameter on a <field> definition like fl_from_dv=true but aside from the awkward name, I think this is harder to wrap your head around. If you were willing to pay the "wasted disk space" penalty though, a separate parameter on the <field> to control where the "stored" value came from would allow one to switch back and forth without re-indexing. But to be clear, I don't think this is a good idea; using the stored=docvalues option forces a decision to be made up-front.
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          Erick Erickson As of now, users got the stored value directly, but the dv value by using field(). If we load stored fields also from docValues, do you think adding a function stored(), which would actually go and retrieve the row stored value, solve this usecase?

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - Erick Erickson As of now, users got the stored value directly, but the dv value by using field(). If we load stored fields also from docValues, do you think adding a function stored(), which would actually go and retrieve the row stored value, solve this usecase?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          We do want to keep in mind though that from a user perspective, inputting '2.178734872984723984729387" as a float and getting back "2.17873477935791"

          Right. Numeric values are already stored in binary form, so any truncation that may occur is already occurring (and should occur in an identical way for docvalues).

          Show
          yseeley@gmail.com Yonik Seeley added a comment - We do want to keep in mind though that from a user perspective, inputting '2.178734872984723984729387" as a float and getting back "2.17873477935791" Right. Numeric values are already stored in binary form, so any truncation that may occur is already occurring (and should occur in an identical way for docvalues).
          Hide
          erickerickson Erick Erickson added a comment -

          Ishan Chattopadhyaya I think it'd solve the use-case. and I'm +1 on adding stored().

          I'm also concerned with the upgrade case though. If someone takes their 5x schema and app (and I'm assuming all this is 6.0 at the earliest) and does nothing to their schema or app, I'd like them to see no change. If we require them to use stored() to pull the stored value explicitly, the app needs to change or the users will be surprised.

          Which is why I kinda like the stored=docvalues option. It would mean that they get the same behavior they see now if they do nothing, but the option of doing things the new way without re-indexing.

          Hmmm, I like also adding the stored() option. That would allow someone to take an existing schema/index/app, add stored=docvalues option (for fields that are already DV fields) and still get the raw stored value if they want.

          Hmmmmm2. the more I think about it, the more I think that it's probably better to have something like a new option. Rather than stored=docvalues, something like defaultStoredReturn=docvalues|stored. That would allow an existing index with a DV field to have the maximum flexibility. Plus it would allow a single field definition to serve all the use cases. My original idea of stored=docvalues wouldn't put anything in the fdt files when indexing, so a stored() operation wouldn't work on a new index.

          Hmmmmm3. Maybe I'm worrying too much about flexibility. We already have enough trouble explaining the nuances of stored/indexed/docValues. Adding defaultStoredReturn to the mix makes it really ugly. Just using stored=docvalues allows people to make a decision, and the back-compat story is simple. "To take advantage of these efficiencies, you may have to reindex if you care about the edge case of floats/doubles/ints/longs (and maybe dates)".

          Show
          erickerickson Erick Erickson added a comment - Ishan Chattopadhyaya I think it'd solve the use-case. and I'm +1 on adding stored(). I'm also concerned with the upgrade case though. If someone takes their 5x schema and app (and I'm assuming all this is 6.0 at the earliest) and does nothing to their schema or app, I'd like them to see no change. If we require them to use stored() to pull the stored value explicitly, the app needs to change or the users will be surprised. Which is why I kinda like the stored=docvalues option. It would mean that they get the same behavior they see now if they do nothing, but the option of doing things the new way without re-indexing. Hmmm, I like also adding the stored() option. That would allow someone to take an existing schema/index/app, add stored=docvalues option (for fields that are already DV fields) and still get the raw stored value if they want. Hmmmmm2. the more I think about it, the more I think that it's probably better to have something like a new option. Rather than stored=docvalues, something like defaultStoredReturn=docvalues|stored. That would allow an existing index with a DV field to have the maximum flexibility. Plus it would allow a single field definition to serve all the use cases. My original idea of stored=docvalues wouldn't put anything in the fdt files when indexing, so a stored() operation wouldn't work on a new index. Hmmmmm3. Maybe I'm worrying too much about flexibility. We already have enough trouble explaining the nuances of stored/indexed/docValues. Adding defaultStoredReturn to the mix makes it really ugly. Just using stored=docvalues allows people to make a decision, and the back-compat story is simple. "To take advantage of these efficiencies, you may have to reindex if you care about the edge case of floats/doubles/ints/longs (and maybe dates)".
          Hide
          erickerickson Erick Erickson added a comment - - edited

          Damn, ya' learn something new every day.

          Golly, I need to think before I type. We get the transformed date back from the stored field.

          So never mind the whole "they may not get back what they put in" discussion, it's misguided.

          Show
          erickerickson Erick Erickson added a comment - - edited Damn, ya' learn something new every day. Golly, I need to think before I type. We get the transformed date back from the stored field. So never mind the whole "they may not get back what they put in" discussion, it's misguided.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          I'm also concerned with the upgrade case though. If someone takes their 5x schema and app (and I'm assuming all this is 6.0 at the earliest) and does nothing to their schema or app, I'd like them to see no change.

          The plan from SOLR-8220 was to use the schema version for back compat... meaning that if you have unstored docValues fields, they won't suddenly start appearing when you do fl=*

          Show
          yseeley@gmail.com Yonik Seeley added a comment - I'm also concerned with the upgrade case though. If someone takes their 5x schema and app (and I'm assuming all this is 6.0 at the earliest) and does nothing to their schema or app, I'd like them to see no change. The plan from SOLR-8220 was to use the schema version for back compat... meaning that if you have unstored docValues fields, they won't suddenly start appearing when you do fl=*
          Hide
          erickerickson Erick Erickson added a comment -

          What is the status of this now that SOLR-8220 has been committed? Still under development?

          Show
          erickerickson Erick Erickson added a comment - What is the status of this now that SOLR-8220 has been committed? Still under development?
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          Erick, I'll take a fresh look at this post 6.0 release, and attempt an initial patch, unless someone else wants to take it up.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - Erick, I'll take a fresh look at this post 6.0 release, and attempt an initial patch, unless someone else wants to take it up.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Related: in my testing, it seemed like retrieving docvalue fields that were unstored was much slower than the /export handler... and it wasn't the sorting part, this was sustained slowness, pointing to our retrieval code.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Related: in my testing, it seemed like retrieving docvalue fields that were unstored was much slower than the /export handler... and it wasn't the sorting part, this was sustained slowness, pointing to our retrieval code.
          Hide
          erickerickson Erick Erickson added a comment -

          Is it possible to preferentially return DV when doing the first pass of a distributed search? IIRC, Yonik said at one point that we already get the sort values from the index (or perhaps we can pull them from DV fields). But I just verified that the logic in DocStreamer decompresses the doc pretty much no matter what. I've just verified this in 6.2. The test for whether all the fields are DV fields fails due to the presence of the "score" pseudo field.

          I added a hack patch to 6810 to illustrate one path that sidesteps decompression.

          Anyway, I was wondering if we could somehow detect that this was the first pass and return the DV fields all the time, although I do wonder if some obscure case where the sorted order of the multiValued DV fields would mess things up.

          Show
          erickerickson Erick Erickson added a comment - Is it possible to preferentially return DV when doing the first pass of a distributed search? IIRC, Yonik said at one point that we already get the sort values from the index (or perhaps we can pull them from DV fields). But I just verified that the logic in DocStreamer decompresses the doc pretty much no matter what. I've just verified this in 6.2. The test for whether all the fields are DV fields fails due to the presence of the "score" pseudo field. I added a hack patch to 6810 to illustrate one path that sidesteps decompression. Anyway, I was wondering if we could somehow detect that this was the first pass and return the DV fields all the time, although I do wonder if some obscure case where the sorted order of the multiValued DV fields would mess things up.
          Hide
          dsmiley David Smiley added a comment - - edited

          Is it possible to preferentially return DV when doing the first pass of a distributed search?

          Yes with customizations. I have a patch in https://issues.apache.org/jira/browse/SOLR-5478 which is used by one of my clients, and I have another deployed implementation in which a custom request handler extending SearchHandler rewrites the id field in fl to be field(id) in certain circumstances, and that didn't require patching Solr. I'd love to see SOLR-5478 get finally put to bed. Oh and I wrote tests. Hmmm.

          Show
          dsmiley David Smiley added a comment - - edited Is it possible to preferentially return DV when doing the first pass of a distributed search? Yes with customizations. I have a patch in https://issues.apache.org/jira/browse/SOLR-5478 which is used by one of my clients, and I have another deployed implementation in which a custom request handler extending SearchHandler rewrites the id field in fl to be field(id) in certain circumstances, and that didn't require patching Solr. I'd love to see SOLR-5478 get finally put to bed. Oh and I wrote tests. Hmmm.
          Hide
          dsmiley David Smiley added a comment -

          This issue (SOLR-8344) seems to be about a decision, not about actual code. Linking to SOLR-5478 which contains a patch that actually does this.

          Show
          dsmiley David Smiley added a comment - This issue ( SOLR-8344 ) seems to be about a decision, not about actual code. Linking to SOLR-5478 which contains a patch that actually does this.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          I think instead of find out the default solution. Why don't we do a quick estimation like this

          dvEstimate = estimateReadingTimeForDocValuesAsDefault(List<String> fields, int numDocs);
          storedEstimate = estimateReadingTimeForStoredAsDefault(List<String> fields, int numDocs);
          

          By comparing dvEstimate and storedEstimate we can choose to use dv or stored field here.

          Show
          caomanhdat Cao Manh Dat added a comment - I think instead of find out the default solution. Why don't we do a quick estimation like this dvEstimate = estimateReadingTimeForDocValuesAsDefault(List< String > fields, int numDocs); storedEstimate = estimateReadingTimeForStoredAsDefault(List< String > fields, int numDocs); By comparing dvEstimate and storedEstimate we can choose to use dv or stored field here.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          Here are my patch for this ticket. The work here is simple. I created a new class called RetrieveFieldsOptimizer, It takes input from docFetcher, returnFields and docList, then return storedFields and docValuesFields need to retrieve.

          The optimization here is very simple ( we can do more optimization in the future if we want ). In case of field has both stored and docValues, we always use docValues unless the numDocs is small and exist a field need to be returned but only stored.

          Therefore in first pass ( with id and score fields only ) we will always use docValues to retrieve id field.

          Here are some benchmark result ( 3 shards, 1 replica each, SSD drive, 18000 docs, run 500 times and get average )

          Query AVG QTime with optimizer AVG QTime without optimizer
          q=:&fl=title_str_dv_stored,id,revision_text_str_dv_stored&start=10000 54 172
          q=:&fl=title_str_dv_stored,id,revision_text_str_dv_stored&start=10 2 2
          q=:&fl=title_str_dv_stored,id,revision_text_str_dv_stored&rows=1000 40 64
          q=:&fl=title_str_dv_stored,id,revision_text_str_stored&start=10000 53 175
          q=:&fl=title_str_dv_stored,id,revision_text_str_stored&start=10 2 2
          q=:&fl=title_str_dv_stored,id,revision_text_str_stored&rows=1000 56 64
          Show
          caomanhdat Cao Manh Dat added a comment - - edited Here are my patch for this ticket. The work here is simple. I created a new class called RetrieveFieldsOptimizer , It takes input from docFetcher, returnFields and docList, then return storedFields and docValuesFields need to retrieve. The optimization here is very simple ( we can do more optimization in the future if we want ). In case of field has both stored and docValues, we always use docValues unless the numDocs is small and exist a field need to be returned but only stored. Therefore in first pass ( with id and score fields only ) we will always use docValues to retrieve id field. Here are some benchmark result ( 3 shards, 1 replica each, SSD drive, 18000 docs, run 500 times and get average ) Query AVG QTime with optimizer AVG QTime without optimizer q= : &fl=title_str_dv_stored,id,revision_text_str_dv_stored&start=10000 54 172 q= : &fl=title_str_dv_stored,id,revision_text_str_dv_stored&start=10 2 2 q= : &fl=title_str_dv_stored,id,revision_text_str_dv_stored&rows=1000 40 64 q= : &fl=title_str_dv_stored,id,revision_text_str_stored&start=10000 53 175 q= : &fl=title_str_dv_stored,id,revision_text_str_stored&start=10 2 2 q= : &fl=title_str_dv_stored,id,revision_text_str_stored&rows=1000 56 64
          Hide
          tomasflobbe Tomás Fernández Löbbe added a comment -

          Interesting! Two questions:

          q=:&fl=TITLE_str,id,REVISION_TEXT_str&start=10000

          Did you mean "rows=10000" or are you actually changing start?
          What happens in your benchmark if at least one of the fields to return is a text field (or has docValues=false)? Does getting the stored content of one of the fields change your benchmark results?

          Show
          tomasflobbe Tomás Fernández Löbbe added a comment - Interesting! Two questions: q=:&fl=TITLE_str,id,REVISION_TEXT_str&start=10000 Did you mean "rows=10000" or are you actually changing start? What happens in your benchmark if at least one of the fields to return is a text field (or has docValues=false)? Does getting the stored content of one of the fields change your benchmark results?
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Tomás Fernández Löbbe I changed start param. Therefore, in the first pass, every shard will send back 10010 ids back to the query master. Right now we will use stored field to retrieve doc ids, with this patch, we will use doc values field to retrieve doc ids ( which is significant faster ).
          In case of less values to be retrieved ( case 2 ), the different is not much.

          What happens in your benchmark if at least one of the fields to return is a text field (or has docValues=false)? Does getting the stored content of one of the fields change your benchmark results?

          Yeah, I'm about to testing that, we post result latter. But keep in mind that, in case of multi-value field, we can't use docValues because values are stored as sorted-set.

          Show
          caomanhdat Cao Manh Dat added a comment - Tomás Fernández Löbbe I changed start param. Therefore, in the first pass, every shard will send back 10010 ids back to the query master. Right now we will use stored field to retrieve doc ids, with this patch, we will use doc values field to retrieve doc ids ( which is significant faster ). In case of less values to be retrieved ( case 2 ), the different is not much. What happens in your benchmark if at least one of the fields to return is a text field (or has docValues=false)? Does getting the stored content of one of the fields change your benchmark results? Yeah, I'm about to testing that, we post result latter. But keep in mind that, in case of multi-value field, we can't use docValues because values are stored as sorted-set.
          Hide
          dsmiley David Smiley added a comment -

          I don't think the number of documents is that pertinent since both strategies are O(docs) multiplied by other factors. Remember that testing this is also complicated by the existence of the Solr document cache. Why not have the strategy simply observe if it's possible at all to completely use docValues. If it is, do it, if not, don't. If the "fl" contains a wildcard, we may want to give up on using docValues altogether.

          Show
          dsmiley David Smiley added a comment - I don't think the number of documents is that pertinent since both strategies are O(docs) multiplied by other factors. Remember that testing this is also complicated by the existence of the Solr document cache. Why not have the strategy simply observe if it's possible at all to completely use docValues. If it is, do it, if not, don't. If the "fl" contains a wildcard, we may want to give up on using docValues altogether.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Why not have the strategy simply observe if it's possible at all to completely use docValues.

          +1

          If the "fl" contains a wildcard, we may want to give up on using docValues

          Or resolve wildcards first.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Why not have the strategy simply observe if it's possible at all to completely use docValues. +1 If the "fl" contains a wildcard, we may want to give up on using docValues Or resolve wildcards first.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          David Smiley

          I don't think the number of documents is that pertinent since both strategies are O(docs) multiplied by other factors.

          No, It matter in case of "fl" contains a field which is stored only (let's call it field1) and "rows" is small. Because we already have to pay for seek cost for reading field1 therefore reading other fields from stored will be faster than reading from DV. I tested this result with Lucene (not Solr) so we don't have to worry about caching will affect the result here.

          Show
          caomanhdat Cao Manh Dat added a comment - - edited David Smiley I don't think the number of documents is that pertinent since both strategies are O(docs) multiplied by other factors. No, It matter in case of "fl" contains a field which is stored only (let's call it field1) and "rows" is small. Because we already have to pay for seek cost for reading field1 therefore reading other fields from stored will be faster than reading from DV. I tested this result with Lucene (not Solr) so we don't have to worry about caching will affect the result here.
          Hide
          dsmiley David Smiley added a comment -

          No, It matter in case of "fl" contains a field which is stored only (let's call it field1) and "rows" is small.

          Which scenario is this in your benchmark above? Or some other benchmark you allude to?

          Because we already have to pay for seek cost for reading field1 therefore reading other fields from stored will be faster than reading from DV

          I understand that but don't see what that has to do with the number of documents.

          In the event every field in "fl" has docValues, stored, single valued: will your patch use docValues data always? If not, why not? Maybe I should study your patch further but I was a bit confused.

          I would like to consider the relationship of this optimization on the document cache, if not in this patch then perhaps in a follow-up. If in some request we realize we can avoid accessing the stored document and instead do only with docValues, we should avoid polluting the document cache as well, I think. Maybe we will Consider the first phase of distributed search that only wants the uniqueKey field.

          In your patch, this: if (doc.containsKey(fieldName)) doc.remove(fieldName); can be simplified to remove the needless condition

          Show
          dsmiley David Smiley added a comment - No, It matter in case of "fl" contains a field which is stored only (let's call it field1) and "rows" is small. Which scenario is this in your benchmark above? Or some other benchmark you allude to? Because we already have to pay for seek cost for reading field1 therefore reading other fields from stored will be faster than reading from DV I understand that but don't see what that has to do with the number of documents. In the event every field in "fl" has docValues, stored, single valued: will your patch use docValues data always? If not, why not? Maybe I should study your patch further but I was a bit confused. I would like to consider the relationship of this optimization on the document cache, if not in this patch then perhaps in a follow-up. If in some request we realize we can avoid accessing the stored document and instead do only with docValues, we should avoid polluting the document cache as well, I think. Maybe we will Consider the first phase of distributed search that only wants the uniqueKey field. In your patch, this: if (doc.containsKey(fieldName)) doc.remove(fieldName); can be simplified to remove the needless condition
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          Which scenario is this in your benchmark above? Or some other benchmark you allude to?

          Yeah, it belongs to another simple benchmark.

          I understand that but don't see what that has to do with the number of documents.

          Hmm, I re-run the benchmark and you're right. They are almost the same.

          In the event every field in "fl" has docValues, stored, single valued: will your patch use docValues data always

          Yeah, It always use docValues.

          If in some request we realize we can avoid accessing the stored document and instead do only with docValues, we should avoid polluting the document cache as well

          With this patch, if we notice that we do not need stored field, so we will skip reading the document as well as caching.

          In your patch, this: if (doc.containsKey(fieldName)) doc.remove(fieldName); can be simplified to remove the needless condition

          I don't know what you means, but if we remove this line, some tests will get failed ( TestPseudoReturnFields )

          Show
          caomanhdat Cao Manh Dat added a comment - - edited Which scenario is this in your benchmark above? Or some other benchmark you allude to? Yeah, it belongs to another simple benchmark. I understand that but don't see what that has to do with the number of documents. Hmm, I re-run the benchmark and you're right. They are almost the same. In the event every field in "fl" has docValues, stored, single valued: will your patch use docValues data always Yeah, It always use docValues. If in some request we realize we can avoid accessing the stored document and instead do only with docValues, we should avoid polluting the document cache as well With this patch, if we notice that we do not need stored field, so we will skip reading the document as well as caching. In your patch, this: if (doc.containsKey(fieldName)) doc.remove(fieldName); can be simplified to remove the needless condition I don't know what you means, but if we remove this line, some tests will get failed ( TestPseudoReturnFields )
          Hide
          dsmiley David Smiley added a comment -

          I mean just call {{doc.remove(fieldName)}; no need to guard this with an "if". If it's not there, doc.remove is harmless.

          Show
          dsmiley David Smiley added a comment - I mean just call {{doc.remove(fieldName)}; no need to guard this with an "if". If it's not there, doc.remove is harmless.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Ah yeah, you're right.

          Show
          caomanhdat Cao Manh Dat added a comment - Ah yeah, you're right.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          BTW I think that for this ticket, the current optimization is good enough ( we can remove numRows check before doing commit ). And we can always change the optimize method latter.

          Show
          caomanhdat Cao Manh Dat added a comment - - edited BTW I think that for this ticket, the current optimization is good enough ( we can remove numRows check before doing commit ). And we can always change the optimize method latter.
          Hide
          tomasflobbe Tomás Fernández Löbbe added a comment -

          Because we already have to pay for seek cost for reading field1 therefore reading other fields from stored will be faster than reading from DV

          That would be my guess. But then, wouldn't we prefer to use stored fields for all fields (if possible) if at least one of the fields needs to come from stored fields? In the patch, it looks like, if rows >= 100, then we try to get everything from DVs when possible, right?

          Show
          tomasflobbe Tomás Fernández Löbbe added a comment - Because we already have to pay for seek cost for reading field1 therefore reading other fields from stored will be faster than reading from DV That would be my guess. But then, wouldn't we prefer to use stored fields for all fields (if possible) if at least one of the fields needs to come from stored fields? In the patch, it looks like, if rows >= 100, then we try to get everything from DVs when possible, right?
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Tomás Fernández Löbbe Yeah in my benchmark for Lucene, they will almost the same, so I would like to read from docValues, therefore docValues fields can be cached.

          Show
          caomanhdat Cao Manh Dat added a comment - Tomás Fernández Löbbe Yeah in my benchmark for Lucene, they will almost the same, so I would like to read from docValues, therefore docValues fields can be cached.
          Hide
          dsmiley David Smiley added a comment -

          Please post a new patch so we can see before you commit it. Furthermore, can you please include a simple summary of the approach in the javadoc of the optimizer.... something like (assuming this is accurate):

          Sometimes we could fetch a field value from either the stored document or docValues. Such fields have both and are single-valued. If choosing docValues allows us to avoid accessing the stored document altogether for all fields to be returned then we do it, otherwise we prefer the stored value when we have a choice.

          IMO An effective strategy and straight-forward to explain.

          Show
          dsmiley David Smiley added a comment - Please post a new patch so we can see before you commit it. Furthermore, can you please include a simple summary of the approach in the javadoc of the optimizer.... something like (assuming this is accurate): Sometimes we could fetch a field value from either the stored document or docValues. Such fields have both and are single-valued. If choosing docValues allows us to avoid accessing the stored document altogether for all fields to be returned then we do it, otherwise we prefer the stored value when we have a choice. IMO An effective strategy and straight-forward to explain.
          Hide
          thetaphi Uwe Schindler added a comment -

          I'd always use docValues for the ID field (if available) for the first pass of a cloud/distributed search. I had a customer with huge stored fields and the first pass of the distributed search got a major slowdown. Only workaround was to use single-pass, which had other problems (of course).

          So I'd suggest to add the following special case: If first pass of distributed search AND id field has docvalues => also use docValues.

          Show
          thetaphi Uwe Schindler added a comment - I'd always use docValues for the ID field (if available) for the first pass of a cloud/distributed search. I had a customer with huge stored fields and the first pass of the distributed search got a major slowdown. Only workaround was to use single-pass, which had other problems (of course). So I'd suggest to add the following special case: If first pass of distributed search AND id field has docvalues => also use docValues.
          Hide
          dsmiley David Smiley added a comment -

          Uwe Schindler it isn't necessary to have a special condition for distributed search. As long as the uniqueKey field has docValues, the algorithm I've been promoting here will choose it. Can you see that from my proposed documentation comment above? If not can you suggest improved wording?

          Show
          dsmiley David Smiley added a comment - Uwe Schindler it isn't necessary to have a special condition for distributed search. As long as the uniqueKey field has docValues, the algorithm I've been promoting here will choose it. Can you see that from my proposed documentation comment above? If not can you suggest improved wording?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          A related thing worth thinking about: what should the default be for our docValue fields in our schema templates / examples?
          IIRC, they are both stored and docValues=true currently, which was a conservative decision since we didn't know the impact of trying to use docValues only for stored fields. That extra space may not be worth it in the average case though.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - A related thing worth thinking about: what should the default be for our docValue fields in our schema templates / examples? IIRC, they are both stored and docValues=true currently, which was a conservative decision since we didn't know the impact of trying to use docValues only for stored fields. That extra space may not be worth it in the average case though.
          Hide
          thetaphi Uwe Schindler added a comment -

          For the Id field we figured out that it breaks some stuff if stored is missing. E.g., More like this breaks. IMHO this is a bug, because MLT should use the doc cache / searcher to get the MLT values including id.

          Show
          thetaphi Uwe Schindler added a comment - For the Id field we figured out that it breaks some stuff if stored is missing. E.g., More like this breaks. IMHO this is a bug, because MLT should use the doc cache / searcher to get the MLT values including id.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          IIRC, they are both stored and docValues=true currently, which was a conservative decision since we didn't know the impact of trying to use docValues only for stored fields. That extra space may not be worth it in the average case though.

          Yeah, I think in case of single-value field, we should use docValues=true only.

          Show
          caomanhdat Cao Manh Dat added a comment - IIRC, they are both stored and docValues=true currently, which was a conservative decision since we didn't know the impact of trying to use docValues only for stored fields. That extra space may not be worth it in the average case though. Yeah, I think in case of single-value field, we should use docValues=true only.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch based on your hint David Smiley

          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch based on your hint David Smiley
          Hide
          dsmiley David Smiley added a comment -
          • SolrDocumentFetcher: minor: allStoreds variables & getter should simply be allStored (no trailing 's' which reads weird)
          • RetrieveFieldsOptimizer.storedFields add comment saying "null" means all available
          • RetrieveFieldsOptimizer.calcStoredFieldsForReturn: in the hasPatternMatching condition check: eagerly initialize storedFields to a new HashSet instead of lazy. If it turns out that no stored fields match, we don't want to act like we want all stored fields, which is what I think will happen with your logic. To clarify calcStoredFieldsForReturn further, I suggest making storedFields final, and return null early for wantsAllFields. (IMO final fields can help clarify non-trivial logic)
          • RetrieveFieldsOptimizer.optimize: couldn't the loop here be replaced with a dvFields.addAll(storedFields); storedFields.clear(); ?
          • minor: It's debatable to me that RetrieveFieldsOptimizer is its own class. It seems to me that a few static methods on DocsStreamer would be fine? (package access for testing). No big deal though.
          • do we really need to make this setting toggle-able with a new parameter? I suppose you left this in for performance testing but I think it can be removed now.
          • should we consider the presence of useDocValuesAsStored=false in one of the fields that are both stored and docValues as a signal that we should not do this optimization? I suppose so.

          I think I see an existing bug (not introduced here) in the logic you moved – calcDocValueFieldsForReturn. If fl=foo*,dvField and dvField has useDocValuesAsStored=false then the code won't return dvField even though it's been explicitly mentioned. I haven't tried this; I'm just reading the code carefully.

          Show
          dsmiley David Smiley added a comment - SolrDocumentFetcher: minor: allStoreds variables & getter should simply be allStored (no trailing 's' which reads weird) RetrieveFieldsOptimizer.storedFields add comment saying "null" means all available RetrieveFieldsOptimizer.calcStoredFieldsForReturn: in the hasPatternMatching condition check: eagerly initialize storedFields to a new HashSet instead of lazy. If it turns out that no stored fields match, we don't want to act like we want all stored fields, which is what I think will happen with your logic. To clarify calcStoredFieldsForReturn further, I suggest making storedFields final, and return null early for wantsAllFields. (IMO final fields can help clarify non-trivial logic) RetrieveFieldsOptimizer.optimize: couldn't the loop here be replaced with a dvFields.addAll(storedFields); storedFields.clear(); ? minor: It's debatable to me that RetrieveFieldsOptimizer is its own class. It seems to me that a few static methods on DocsStreamer would be fine? (package access for testing). No big deal though. do we really need to make this setting toggle-able with a new parameter? I suppose you left this in for performance testing but I think it can be removed now. should we consider the presence of useDocValuesAsStored=false in one of the fields that are both stored and docValues as a signal that we should not do this optimization? I suppose so. I think I see an existing bug (not introduced here) in the logic you moved – calcDocValueFieldsForReturn. If fl=foo*,dvField and dvField has useDocValuesAsStored=false then the code won't return dvField even though it's been explicitly mentioned. I haven't tried this; I'm just reading the code carefully.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          Thanks for your review, that's great. But

          minor: It's debatable to me that RetrieveFieldsOptimizer is its own class. It seems to me that a few static methods on DocsStreamer would be fine? (package access for testing). No big deal though.

          I think the purpose of the class is clear to me and this make the code look better

          should we consider the presence of useDocValuesAsStored=false in one of the fields that are both stored and docValues as a signal that we should not do this optimization? I suppose so.

          I maybe misunderstanding but useDocValuesAsStored=true only relate to fl=* case and in that case, storedFields==null which is already skipped in the patch.

          Show
          caomanhdat Cao Manh Dat added a comment - - edited Thanks for your review, that's great. But minor: It's debatable to me that RetrieveFieldsOptimizer is its own class. It seems to me that a few static methods on DocsStreamer would be fine? (package access for testing). No big deal though. I think the purpose of the class is clear to me and this make the code look better should we consider the presence of useDocValuesAsStored=false in one of the fields that are both stored and docValues as a signal that we should not do this optimization? I suppose so. I maybe misunderstanding but useDocValuesAsStored=true only relate to fl=* case and in that case, storedFields==null which is already skipped in the patch.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch for this ticket ( thanks David Smiley the patch looks much better ). I will run the test and do commit soon.

          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch for this ticket ( thanks David Smiley the patch looks much better ). I will run the test and do commit soon.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 40f78dd2740122e5fa18f2515effc9272fd2d902 in lucene-solr's branch refs/heads/master from Cao Manh Dat
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=40f78dd ]

          SOLR-8344: Decide default when requested fields are both column and row stored.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 40f78dd2740122e5fa18f2515effc9272fd2d902 in lucene-solr's branch refs/heads/master from Cao Manh Dat [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=40f78dd ] SOLR-8344 : Decide default when requested fields are both column and row stored.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d877ade945c5b143dcd063e2865ab65368710b92 in lucene-solr's branch refs/heads/branch_7x from Cao Manh Dat
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d877ade ]

          SOLR-8344: Decide default when requested fields are both column and row stored.

          Show
          jira-bot ASF subversion and git services added a comment - Commit d877ade945c5b143dcd063e2865ab65368710b92 in lucene-solr's branch refs/heads/branch_7x from Cao Manh Dat [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d877ade ] SOLR-8344 : Decide default when requested fields are both column and row stored.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          I think I see an existing bug (not introduced here) in the logic you moved – calcDocValueFieldsForReturn. If fl=foo*,dvField and dvField has useDocValuesAsStored=false then the code won't return dvField even though it's been explicitly mentioned. I haven't tried this; I'm just reading the code carefully.

          David Smiley This seems a bug to me, too. I will spin off this issue into another ticket.

          Show
          caomanhdat Cao Manh Dat added a comment - I think I see an existing bug (not introduced here) in the logic you moved – calcDocValueFieldsForReturn. If fl=foo*,dvField and dvField has useDocValuesAsStored=false then the code won't return dvField even though it's been explicitly mentioned. I haven't tried this; I'm just reading the code carefully. David Smiley This seems a bug to me, too. I will spin off this issue into another ticket.

            People

            • Assignee:
              caomanhdat Cao Manh Dat
              Reporter:
              ichattopadhyaya Ishan Chattopadhyaya
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development