Solr
  1. Solr
  2. SOLR-5670

_version_ either indexed OR docvalue

    Details

      Description

      As far as I can see there is no good reason to require that "version" field has to be indexed if it is docvalued. So I guess it will be ok with a rule saying "version has to be either indexed or docvalue (allowed to be both)".

      1. SOLR-5670.patch
        1 kB
        Shawn Heisey
      2. SOLR-5670.patch
        1 kB
        Per Steffensen

        Activity

        Hide
        Per Steffensen added a comment - - edited

        Simple patch attached. No testes of it added, but I have seen it working locally. 4.4.0 test-suite is green with this change. Do not know if branch_4x test-suite is.

        Show
        Per Steffensen added a comment - - edited Simple patch attached. No testes of it added, but I have seen it working locally. 4.4.0 test-suite is green with this change. Do not know if branch_4x test-suite is.
        Hide
        Shawn Heisey added a comment - - edited

        From a design perspective, I can't claim to know whether this is an acceptable patch or not. Consistency in configs across multiple users and multiple versions does have some value, which is a very minor argument against this change.

        Is there any benchmark data? If docValues provides better performance for _version_ than indexed when it is used for its intended purpose, it might be worth changing the example config ... but people should know that if they do change the config on this field, they will have to completely reindex.

        This patch is functionally identical to the previous one, it just modifies an error message. I didn't check to see what branch Per's patch was created on, but it did apply cleanly to branch_4x. This patch is against that branch.

        Show
        Shawn Heisey added a comment - - edited From a design perspective, I can't claim to know whether this is an acceptable patch or not. Consistency in configs across multiple users and multiple versions does have some value, which is a very minor argument against this change. Is there any benchmark data? If docValues provides better performance for _version_ than indexed when it is used for its intended purpose, it might be worth changing the example config ... but people should know that if they do change the config on this field, they will have to completely reindex. This patch is functionally identical to the previous one, it just modifies an error message. I didn't check to see what branch Per's patch was created on, but it did apply cleanly to branch_4x. This patch is against that branch.
        Hide
        Per Steffensen added a comment -

        Is there any benchmark data? If docValues provides better performance for version than indexed

        I do not think it will in most cases.

        • Indexed: When you want to get the version for a particular doc-no (found by id), you can make a lookup in FieldCache holding the reversed term-index - this is in memory and constant time. If you have a very rapidly changing data-set (so that FieldCache-entries will be invalidated often due to merging) you might get better performance (response-time) with doc-values - but not in general, I think.
        • DocValues: You will read the version from doc-values which in not necessarily in memory

        We are prepared to take a small performance hit, to avoid having all that data in FieldCache. In general we do not allow putting anything in FieldCache, because we have so many documents, that is always creates issues with too much memory usage. The problem with FieldCache is that it is all or nothing - for a good reasons! - we just cannot live with it.

        We havnt made the change on version (going from indexed to doc-value) in production yet. We will do some performance testing on it first, and depending on how much we decide to do, I can get back with some numbers.

        when it is used for its intended purpose, it might be worth changing the example config

        Do not think you should do that. Using FieldCache is probably the best "default". But writing something somewhere about the option of using doc-values instead of indexed, and when that is a good idea, would be nice.

        ... but people should know that if they do change the config on this field, they will have to completely reindex.

        Or just start using it from now on in new collections. We create a new collection every month and keep a history of data by keeping the "latest" 24 collections. One of many reasons for doing this, is that it provides us the option of changing indexing-strategy etc every month. For us re-indexing is completely out of the question - we have billions and billions of records in Solr and re-indexing them all in a fairly short service-window is not possible. Therefore we built this new-collection-every-month thingy in order to have some flexibility.

        This patch is functionally identical to the previous one, it just modifies an error message.

        Nicely spotted

        I didn't check to see what branch Per's patch was created on, but it did apply cleanly to branch_4x.

        It was branch_4x

        Show
        Per Steffensen added a comment - Is there any benchmark data? If docValues provides better performance for version than indexed I do not think it will in most cases. Indexed: When you want to get the version for a particular doc-no (found by id), you can make a lookup in FieldCache holding the reversed term-index - this is in memory and constant time. If you have a very rapidly changing data-set (so that FieldCache-entries will be invalidated often due to merging) you might get better performance (response-time) with doc-values - but not in general, I think. DocValues: You will read the version from doc-values which in not necessarily in memory We are prepared to take a small performance hit, to avoid having all that data in FieldCache. In general we do not allow putting anything in FieldCache, because we have so many documents, that is always creates issues with too much memory usage. The problem with FieldCache is that it is all or nothing - for a good reasons! - we just cannot live with it. We havnt made the change on version (going from indexed to doc-value) in production yet. We will do some performance testing on it first, and depending on how much we decide to do, I can get back with some numbers. when it is used for its intended purpose, it might be worth changing the example config Do not think you should do that. Using FieldCache is probably the best "default". But writing something somewhere about the option of using doc-values instead of indexed, and when that is a good idea, would be nice. ... but people should know that if they do change the config on this field, they will have to completely reindex. Or just start using it from now on in new collections. We create a new collection every month and keep a history of data by keeping the "latest" 24 collections. One of many reasons for doing this, is that it provides us the option of changing indexing-strategy etc every month. For us re-indexing is completely out of the question - we have billions and billions of records in Solr and re-indexing them all in a fairly short service-window is not possible. Therefore we built this new-collection-every-month thingy in order to have some flexibility. This patch is functionally identical to the previous one, it just modifies an error message. Nicely spotted I didn't check to see what branch Per's patch was created on, but it did apply cleanly to branch_4x. It was branch_4x
        Hide
        Shawn Heisey added a comment -

        Reducing heap requirements by not requiring data to go into the FieldCache is a major win for huge indexes. GC can be a major source of performance issues even if you've got garbage collection superbly tuned, and I doubt that my tuning parameters are perfect.

        Show
        Shawn Heisey added a comment - Reducing heap requirements by not requiring data to go into the FieldCache is a major win for huge indexes. GC can be a major source of performance issues even if you've got garbage collection superbly tuned, and I doubt that my tuning parameters are perfect.
        Hide
        Yonik Seeley added a comment -

        Committed. Thanks!

        Show
        Yonik Seeley added a comment - Committed. Thanks!
        Hide
        Per Steffensen added a comment -

        Thanks, Yonik Seeley

        Show
        Per Steffensen added a comment - Thanks, Yonik Seeley
        Hide
        Yago Riveiro added a comment -

        should be the wiki ref updated with this info?

        This is a minor change, but when we are creating the schema, if we will leverage the docvalues feature, this kind of configurations can matter.

        Show
        Yago Riveiro added a comment - should be the wiki ref updated with this info? This is a minor change, but when we are creating the schema, if we will leverage the docvalues feature, this kind of configurations can matter.
        Hide
        Per Steffensen added a comment -

        Well docValues are not even mentioned on http://wiki.apache.org/solr/SchemaXml so I guess that is not the place to make the updated documentation. Guess it has been replaced by something else? If someone can point me to a place where a few lines of documentation should be added, I will be happy to do it.

        Show
        Per Steffensen added a comment - Well docValues are not even mentioned on http://wiki.apache.org/solr/SchemaXml so I guess that is not the place to make the updated documentation. Guess it has been replaced by something else? If someone can point me to a place where a few lines of documentation should be added, I will be happy to do it.
        Hide
        Yago Riveiro added a comment - - edited

        The Solr guide is here Solr Reference Guide

        Show
        Yago Riveiro added a comment - - edited The Solr guide is here Solr Reference Guide
        Hide
        Per Steffensen added a comment -

        I am not allowed to edit that, but I made a quick comment, saying something about where and how documentation could be added. If that is the right place and if you want to use my exact sentences is up to the editors of this Reference Guide.

        https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents?focusedCommentId=38571510#comment-38571510

        Show
        Per Steffensen added a comment - I am not allowed to edit that, but I made a quick comment, saying something about where and how documentation could be added. If that is the right place and if you want to use my exact sentences is up to the editors of this Reference Guide. https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents?focusedCommentId=38571510#comment-38571510
        Hide
        Jessica Cheng Mallet added a comment -

        Hi Per Steffensen, do you have any sense of at what document size per core does using DocValues for version start being more beneficial?

        Show
        Jessica Cheng Mallet added a comment - Hi Per Steffensen , do you have any sense of at what document size per core does using DocValues for version start being more beneficial?
        Hide
        Per Steffensen added a comment -

        Well, no, I have not.

        Our reason for doing this is not necessarily to make things faster. It basically boils down to avoiding OOMs. When you want to get a value for a particular field on a particular document in Solr/Lucene there are several places to find the information. You can find it in the store (if stored=true for that field), which is fairly slow. You can also find it in the FieldCache, which is an in-memory doc-id to field-value map. Problem with FieldCache is that if you load it, you need to load values for ALL the documents in the index (or at least the segment). If you have enormous amounts of documents in your index this can cause OOM, because you simple cannot have all those values in memory. Then doc-value is for the rescue. Doc-value basically holds the same doc-id to field-value data-structure as the FieldCache, but doc-values are maintained continuously in additional files in your index-folder, so you can just go read the particular value you need, hence avoiding OOM. FieldCache is something you "calculate" into memory, based on data form store or index.

        Show
        Per Steffensen added a comment - Well, no, I have not. Our reason for doing this is not necessarily to make things faster. It basically boils down to avoiding OOMs. When you want to get a value for a particular field on a particular document in Solr/Lucene there are several places to find the information. You can find it in the store (if stored=true for that field), which is fairly slow. You can also find it in the FieldCache, which is an in-memory doc-id to field-value map. Problem with FieldCache is that if you load it, you need to load values for ALL the documents in the index (or at least the segment). If you have enormous amounts of documents in your index this can cause OOM, because you simple cannot have all those values in memory. Then doc-value is for the rescue. Doc-value basically holds the same doc-id to field-value data-structure as the FieldCache, but doc-values are maintained continuously in additional files in your index-folder, so you can just go read the particular value you need, hence avoiding OOM. FieldCache is something you "calculate" into memory, based on data form store or index.

          People

          • Assignee:
            Per Steffensen
            Reporter:
            Per Steffensen
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development