Solr
  1. Solr
  2. SOLR-1782

stats.facet assumes FieldCache.StringIndex - fails horribly on multivalued fields

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: search
    • Labels:
      None
    • Environment:

      reproduced on Win2k3 using 1.5.0-dev solr ($Id: CHANGES.txt 906924 2010-02-05 12:43:11Z noble $)

      Description

      the StatsComponent assumes any field specified in the stats.facet param can be faceted using FieldCache.DEFAULT.getStringIndex. This can cause problems with a variety of field types, but in the case of multivalued fields it can either cause erroneous false stats when the number of distinct values is small, or it can cause ArrayIndexOutOfBoundsException when the number of distinct values is greater then the number of documents.

      1. index.rar
        83 kB
        Gerald DeConto
      2. SOLR-1782.2.patch
        12 kB
        Wojtek Piaseczny
      3. SOLR-1782.2013-01-07.patch
        16 kB
        David Christianson
      4. SOLR-1782.2013-04-10.patch
        21 kB
        Steven Bower
      5. SOLR-1782.patch
        21 kB
        Hoss Man
      6. SOLR-1782.patch
        15 kB
        David Christianson
      7. SOLR-1782.patch
        12 kB
        Wojtek Piaseczny
      8. SOLR-1782.solr-4.2.1.patch
        21 kB
        Patanachai Tangchaisin
      9. SOLR-1782.test.patch
        5 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Gerald DeConto added a comment - - edited

          file contains readme.txt, sample data, solrconfig.xml, data-config.xml and schema.xml

          addtional info (ie results from running test can be found at http://old.nabble.com/getting-unexpected-statscomponent-values-td27599248.html

          Show
          Gerald DeConto added a comment - - edited file contains readme.txt, sample data, solrconfig.xml, data-config.xml and schema.xml addtional info (ie results from running test can be found at http://old.nabble.com/getting-unexpected-statscomponent-values-td27599248.html
          Hide
          Hoss Man added a comment -

          I'm pokign around the attached RAR file now, and two interesting things jump out at me:

          First: StatsFacetField is multivalied. every doc has exactly 3 values, but on some cases the value is repeated...

          <doc>
           <arr name="StatsFacetField">
            <int>2</int>
            <int>3</int>
            <int>1</int>
           </arr>
           <int name="ValueOfOneField">1</int>
           <int name="id">7631</int>
          </doc>
          <doc>
           <arr name="StatsFacetField">
            <int>3</int>
            <int>3</int>
            <int>1</int>
           </arr>
           <int name="ValueOfOneField">1</int>
           <int name="id">7453</int>
          </doc>
          

          Second: the stat facet counts produced by this sample index have the (conicidental?) property that the sum of all the "counts" from each value of the StatsFacetField equals the total number of docs – which should not be the case since each doc contains multiple values. (Note: the output from Gerald's initial email didn't demonstrate this, but the index included in the rar file is inconsistent with his initial email in other ways, so i believe this one was generated with slightly differnet configs)

          I think this suggests that the the bug in StatsComponent is triggered when the stats.facet field refers to a multivalued field. I'm going to see if i can create a simple JUnit test.

          Show
          Hoss Man added a comment - I'm pokign around the attached RAR file now, and two interesting things jump out at me: First: StatsFacetField is multivalied. every doc has exactly 3 values, but on some cases the value is repeated... <doc> <arr name="StatsFacetField"> <int>2</int> <int>3</int> <int>1</int> </arr> <int name="ValueOfOneField">1</int> <int name="id">7631</int> </doc> <doc> <arr name="StatsFacetField"> <int>3</int> <int>3</int> <int>1</int> </arr> <int name="ValueOfOneField">1</int> <int name="id">7453</int> </doc> Second: the stat facet counts produced by this sample index have the (conicidental?) property that the sum of all the "counts" from each value of the StatsFacetField equals the total number of docs – which should not be the case since each doc contains multiple values. (Note: the output from Gerald's initial email didn't demonstrate this, but the index included in the rar file is inconsistent with his initial email in other ways, so i believe this one was generated with slightly differnet configs) I think this suggests that the the bug in StatsComponent is triggered when the stats.facet field refers to a multivalued field. I'm going to see if i can create a simple JUnit test.
          Hide
          Hoss Man added a comment -

          Updating issue summary and description based on the root cause

          Show
          Hoss Man added a comment - Updating issue summary and description based on the root cause
          Hide
          Hoss Man added a comment -

          patch containing two new test methods that demonstrate this bug – one creates an index with a single document (and two values) to trigger the AIOOBE, the second adds many docs with a small number of distinct values, and demonstrates the incorrect stats facet values.

          Off the top of my head i don't know a simple way to fix this ... if no one else has any better suggestions, we should consider adding Schema based error checking to fail fast if the user attempts to facet on fields which are not single valued string fields – some other field types might work, but i'd rather fail more often then we have to then return bogus statistics.

          Show
          Hoss Man added a comment - patch containing two new test methods that demonstrate this bug – one creates an index with a single document (and two values) to trigger the AIOOBE, the second adds many docs with a small number of distinct values, and demonstrates the incorrect stats facet values. Off the top of my head i don't know a simple way to fix this ... if no one else has any better suggestions, we should consider adding Schema based error checking to fail fast if the user attempts to facet on fields which are not single valued string fields – some other field types might work, but i'd rather fail more often then we have to then return bogus statistics.
          Hide
          Hoss Man added a comment -

          Crux of the problem is in FieldFacetStats's dependency on using StringIndex ... at least two different code paths use this in the same (broken) way: Stats.Component.getFieldCacheStats and UnInvertedField.getStats

          Show
          Hoss Man added a comment - Crux of the problem is in FieldFacetStats's dependency on using StringIndex ... at least two different code paths use this in the same (broken) way: Stats.Component.getFieldCacheStats and UnInvertedField.getStats
          Hide
          Wojtek Piaseczny added a comment -

          I'd like to contribute to solving this issue, but I'm not sure if I'm going down the right path. Here are the possible solutions I see:

          1. Use UninvertedField for multi-valued facets in the StatsComponent. This would require a new method in UninvertedField: something like getValues(int docID). The problem with this is the big terms collection in UninvertedField... getting all values for a single document via big terms is expensive (have to iterate entire collection).
          2. Get facet values for the result set in the StatsComponent, then iterate through each value and get a new document set for each value, then iterate through each document in this new set and calculate stats. Sounds expensive.

          Are there better options?

          Show
          Wojtek Piaseczny added a comment - I'd like to contribute to solving this issue, but I'm not sure if I'm going down the right path. Here are the possible solutions I see: 1. Use UninvertedField for multi-valued facets in the StatsComponent. This would require a new method in UninvertedField: something like getValues(int docID). The problem with this is the big terms collection in UninvertedField... getting all values for a single document via big terms is expensive (have to iterate entire collection). 2. Get facet values for the result set in the StatsComponent, then iterate through each value and get a new document set for each value, then iterate through each document in this new set and calculate stats. Sounds expensive. Are there better options?
          Hide
          Wojtek Piaseczny added a comment -

          Added getValues method to UninvertedField class. Might be very slow with many big terms (I happen to have 0 big terms on the fields I'll be using this on).

          Both code paths that use StringIndex now use UnivertedField is multi-valued fields are detected.

          Show
          Wojtek Piaseczny added a comment - Added getValues method to UninvertedField class. Might be very slow with many big terms (I happen to have 0 big terms on the fields I'll be using this on). Both code paths that use StringIndex now use UnivertedField is multi-valued fields are detected.
          Hide
          Wojtek Piaseczny added a comment - - edited

          First patch was unusably slow with ~1M documents. New patch uses both UninvertedField and FieldCache.DocTermsIndex for multi-valued facet fields in StatsComponent. getValues renamed to getTermNumbers to reflect the change.

          Show
          Wojtek Piaseczny added a comment - - edited First patch was unusably slow with ~1M documents. New patch uses both UninvertedField and FieldCache.DocTermsIndex for multi-valued facet fields in StatsComponent. getValues renamed to getTermNumbers to reflect the change.
          Hide
          Anarkii added a comment -

          Is there any update on this issue? Or if Wojtek's fix would be merged into the trunk? I'm trying to do a stats.facet on a field which contains multiple tokens, and have the same issue resulting out of getStringIndex. Say, I'm doing a stats.facet on a field "tags", which can contain multiple tokens...only one of the tags is being considered.

          Show
          Anarkii added a comment - Is there any update on this issue? Or if Wojtek's fix would be merged into the trunk? I'm trying to do a stats.facet on a field which contains multiple tokens, and have the same issue resulting out of getStringIndex. Say, I'm doing a stats.facet on a field "tags", which can contain multiple tokens...only one of the tags is being considered.
          Hide
          biofox added a comment - - edited

          Wojtek and Hoss Man - are you planning to release the changes as part of 1.4.2 or 4.0 ?

          Show
          biofox added a comment - - edited Wojtek and Hoss Man - are you planning to release the changes as part of 1.4.2 or 4.0 ?
          Hide
          biofox added a comment -

          I have applied patch SOLR-1782.2.patch to the latest trunk version (revision 1055053). I have loaded test data and compared the stats results with 1.4.1 release version results. The new stat facet counts and sum match the expected counts for multi-valued fields.

          Show
          biofox added a comment - I have applied patch SOLR-1782 .2.patch to the latest trunk version (revision 1055053). I have loaded test data and compared the stats results with 1.4.1 release version results. The new stat facet counts and sum match the expected counts for multi-valued fields.
          Hide
          David Christianson added a comment -

          Is a fix anticipated? Already in the next version which we should just download? Other issues?

          We are currently running 1.4.1 (955469) and are seeing this problem for some applications. The patch given does not apply out of the box to 1.4.1 (955469) without a few tweaks but so far appears to work.

          Show
          David Christianson added a comment - Is a fix anticipated? Already in the next version which we should just download? Other issues? We are currently running 1.4.1 (955469) and are seeing this problem for some applications. The patch given does not apply out of the box to 1.4.1 (955469) without a few tweaks but so far appears to work.
          Hide
          David Christianson added a comment - - edited

          I have a real need for this functionality and had been using the supplied patch on older distributions. However in more recent builds several classes changed and there was an exception added if you tried to write a stats query using a multivalued facet. So I've taken the code in the original patch and updated it for Solr 4.0, removing the exception and adding a couple of unit tests.

          Show
          David Christianson added a comment - - edited I have a real need for this functionality and had been using the supplied patch on older distributions. However in more recent builds several classes changed and there was an exception added if you tried to write a stats query using a multivalued facet. So I've taken the code in the original patch and updated it for Solr 4.0, removing the exception and adding a couple of unit tests.
          Hide
          Mark Miller added a comment -

          Assigned to hossman in case he has forgotten this issue. He can unassign himself if he doesnt have time.

          Show
          Mark Miller added a comment - Assigned to hossman in case he has forgotten this issue. He can unassign himself if he doesnt have time.
          Hide
          David Christianson added a comment -

          New patch, found bug using Solr 4 patch in practice where calculating statistics over missing values caused an exception to be thrown. Expanded test case to cover this + also check multiple value types.

          Show
          David Christianson added a comment - New patch, found bug using Solr 4 patch in practice where calculating statistics over missing values caused an exception to be thrown. Expanded test case to cover this + also check multiple value types.
          Hide
          Hoss Man added a comment -

          Wojtek & David...

          I spent some time today trying to look over the latest patch ( SOLR-1782.2013-01-07.patch ) and had some trouble wrapping my head arround it.

          In particular i think the crux of my confusion stems from this comment from Wojtek...

          New patch uses both UninvertedField and FieldCache.DocTermsIndex for multi-valued facet fields

          ...which doens't really make sense to me.

          If a field is multivalued, then (unless i'm missunderstanding something) a DocTermsIndex won't make sense for that field at all. At best it's wasted RAM to build up the DocTermsIndex when UnInvertedField is going to be used instead – but in this patch, as part of the "if (null == uif)" sprinkled throughout FieldFacetStats, it looks like term ordinals from the UnInvertedField.getTermNumbers method are then being used to look Term values from the DocTermsIndex ... i don't understand that at all.

          Hopefully this is just a minor bug in the patch, and the call to "si.lookup" in the UIF block(s) of FieldFacetStats.facet & FieldFacetStats.facetTermNum are ment to be calls to some similar method in UnivertedField, but it's not entirely clear.

          If that is the case, then it suggests to me that a better way to organize this code so that it's more clear what's going on (and to save all that wasted RAM used by the DocTermsIndex when going down the UIF code path!) would be to refactor an abstract base class out of FieldFacetStats defining the API contract and then have two concrete impls: one using DocTermsIndex provided in teh constructor, and one using an UnInvertedField provided in it's constructor.

          I'm also a little perplexed by the "UnInvertedField.getStats" method which is modified in this patch, but seemingly only as an aside so it will still compile because of the new param added to the FieldFacetStats constructor. At a glance, I'm not clear on wether this method is even used anywhere, but that makes it seem all the more suspicious: it appears that a lot of the logic in getTermNumbers was cut/paste from this method – should these methods be refactored to eliminate that duplication? can UIF.getStats just be deprecated & removed? Would be make more sense instead of adding some of these new methods to just "fix" UIF.getStats so that it recognizes when another UIF instance is needed for the facet field(s)?

          Lastly: I noticed that while the patch did include tests, it didn't include the original tests I wrote when this issue was filled (see SOLR-1782.test.patch). When I attempted to merge those tests in, I got a failure in testMicroStatsWithMultiValuedFacetField (even after i fixed the obvious case sensitivity problem in the original test) – hopefully this is just a demonstration of the problem i mentioned above about mixing ords from the DocTermsIndex and the UnInvertedField, but it may be a symptom of a diff problem.

          unified patch will all tests attached

          Show
          Hoss Man added a comment - Wojtek & David... I spent some time today trying to look over the latest patch ( SOLR-1782 .2013-01-07.patch ) and had some trouble wrapping my head arround it. In particular i think the crux of my confusion stems from this comment from Wojtek... New patch uses both UninvertedField and FieldCache.DocTermsIndex for multi-valued facet fields ...which doens't really make sense to me. If a field is multivalued, then (unless i'm missunderstanding something) a DocTermsIndex won't make sense for that field at all. At best it's wasted RAM to build up the DocTermsIndex when UnInvertedField is going to be used instead – but in this patch, as part of the "if (null == uif)" sprinkled throughout FieldFacetStats, it looks like term ordinals from the UnInvertedField.getTermNumbers method are then being used to look Term values from the DocTermsIndex ... i don't understand that at all. Hopefully this is just a minor bug in the patch, and the call to "si.lookup" in the UIF block(s) of FieldFacetStats.facet & FieldFacetStats.facetTermNum are ment to be calls to some similar method in UnivertedField, but it's not entirely clear. If that is the case, then it suggests to me that a better way to organize this code so that it's more clear what's going on (and to save all that wasted RAM used by the DocTermsIndex when going down the UIF code path!) would be to refactor an abstract base class out of FieldFacetStats defining the API contract and then have two concrete impls: one using DocTermsIndex provided in teh constructor, and one using an UnInvertedField provided in it's constructor. I'm also a little perplexed by the "UnInvertedField.getStats" method which is modified in this patch, but seemingly only as an aside so it will still compile because of the new param added to the FieldFacetStats constructor. At a glance, I'm not clear on wether this method is even used anywhere, but that makes it seem all the more suspicious: it appears that a lot of the logic in getTermNumbers was cut/paste from this method – should these methods be refactored to eliminate that duplication? can UIF.getStats just be deprecated & removed? Would be make more sense instead of adding some of these new methods to just "fix" UIF.getStats so that it recognizes when another UIF instance is needed for the facet field(s)? Lastly: I noticed that while the patch did include tests, it didn't include the original tests I wrote when this issue was filled (see SOLR-1782 .test.patch). When I attempted to merge those tests in, I got a failure in testMicroStatsWithMultiValuedFacetField (even after i fixed the obvious case sensitivity problem in the original test) – hopefully this is just a demonstration of the problem i mentioned above about mixing ords from the DocTermsIndex and the UnInvertedField, but it may be a symptom of a diff problem. unified patch will all tests attached
          Hide
          David Christianson added a comment -

          Thanks for looking at this! I'll look at the tests in the unified patch.

          Two questions, then my understanding of what's going on in the patch.

          1. Pure ignorance of the codebase prevents me from understanding perf issues with these objects - just that UnInvertedField got used for statistics collection with multivalued facets - how is it possible to get the term values without the DocTermsIndex?
          2. Should DocTermsIndex be used at all?

          My understanding:

          I can see there is definitely a code dup issue (or at least a lot of branching logic) not only in the patch but in the current code.

          1. the original StatsComponent used a DocTermsIndex to look up facet values to be accumulated using DTI.getOrd().
          2. In the patch UnInvertedField is used in FieldFacetStats to deal with multivalued facet fields, which mainly included modifying UnInvertedField to provide the getTermNumbers method that listed term ordinals. As an aside, I don't understand whether this is reasonable to do or not, why there wouldn't already be a way to get at all the values of a field.
          3. Within FieldFacetStats the if (null == uif) is added. The code on both branches is more or less duplicate, only the multivalued branch (null != uif) iterates over multiple ordinals using the UnInverted Field whereas the single valued branch only reads one ordingal. But both branches use DocTermsIndex.
          4. UIF.getStats is used in getStatsInfo in the current code base. It looks to me it is specifically used when the value being computed is multivalued and in the patch gets modified to handle the case of a multivalued facet field accumulating stats on a multivalued value field.
          5. There is another facet accumulation method in FieldFacetStats used in the current code base when accumulating stats over multivalued value fields. This winds up getting the if (null == uif) treatment as well. So code duplication is multiplied
          6. Even though there is this code dup in the current code base wrt UIF.getStats it still seems like FieldFacetStats is the core - maybe there's a way to fix it all?

          Show
          David Christianson added a comment - Thanks for looking at this! I'll look at the tests in the unified patch. Two questions, then my understanding of what's going on in the patch. 1. Pure ignorance of the codebase prevents me from understanding perf issues with these objects - just that UnInvertedField got used for statistics collection with multivalued facets - how is it possible to get the term values without the DocTermsIndex? 2. Should DocTermsIndex be used at all? My understanding: I can see there is definitely a code dup issue (or at least a lot of branching logic) not only in the patch but in the current code. 1. the original StatsComponent used a DocTermsIndex to look up facet values to be accumulated using DTI.getOrd(). 2. In the patch UnInvertedField is used in FieldFacetStats to deal with multivalued facet fields, which mainly included modifying UnInvertedField to provide the getTermNumbers method that listed term ordinals. As an aside, I don't understand whether this is reasonable to do or not, why there wouldn't already be a way to get at all the values of a field. 3. Within FieldFacetStats the if (null == uif) is added. The code on both branches is more or less duplicate, only the multivalued branch (null != uif) iterates over multiple ordinals using the UnInverted Field whereas the single valued branch only reads one ordingal. But both branches use DocTermsIndex. 4. UIF.getStats is used in getStatsInfo in the current code base. It looks to me it is specifically used when the value being computed is multivalued and in the patch gets modified to handle the case of a multivalued facet field accumulating stats on a multivalued value field. 5. There is another facet accumulation method in FieldFacetStats used in the current code base when accumulating stats over multivalued value fields. This winds up getting the if (null == uif) treatment as well. So code duplication is multiplied 6. Even though there is this code dup in the current code base wrt UIF.getStats it still seems like FieldFacetStats is the core - maybe there's a way to fix it all?
          Hide
          Steven Bower added a comment -

          Attached is a patch that I believe fixes the issues found in the latest unified patch from (2013.01.28)... The previous patch was interchangably using term nums from an UninvertedField with a FieldCache... this failed in cases where there were multiple values in the facet field.

          I updated to use uif.lookupTerm() and a TermsEnum generated from uif.getOrdTermsEnum()..

          All the tests now pass, but I am uncertain of the performance/memory footprint of the GetOrdTermsEnum() and/or whether it would be possible to hold onto this enum for longer than I am now.

          Show
          Steven Bower added a comment - Attached is a patch that I believe fixes the issues found in the latest unified patch from (2013.01.28)... The previous patch was interchangably using term nums from an UninvertedField with a FieldCache... this failed in cases where there were multiple values in the facet field. I updated to use uif.lookupTerm() and a TermsEnum generated from uif.getOrdTermsEnum().. All the tests now pass, but I am uncertain of the performance/memory footprint of the GetOrdTermsEnum() and/or whether it would be possible to hold onto this enum for longer than I am now.
          Hide
          Elran Dvir added a comment -

          Hi,

          I tried to apply the latest patch to 4.2.1 and it doesn't seem to be compatible.
          Do you plan a patch compatible to this version?

          Thanks.

          Show
          Elran Dvir added a comment - Hi, I tried to apply the latest patch to 4.2.1 and it doesn't seem to be compatible. Do you plan a patch compatible to this version? Thanks.
          Hide
          Nelson Gonzalez Gonzalez added a comment -

          Hi,

          Is there any patch to apply on solr version 4.4 or 4.5 to fix this issue?

          Show
          Nelson Gonzalez Gonzalez added a comment - Hi, Is there any patch to apply on solr version 4.4 or 4.5 to fix this issue?
          Hide
          Steven Bower added a comment -

          As much of this functionality is replaced by the Analytics component I don't intent to work much more on this patch.

          Show
          Steven Bower added a comment - As much of this functionality is replaced by the Analytics component I don't intent to work much more on this patch.
          Hide
          Patanachai Tangchaisin added a comment -

          Here is patch for Solr 4.2.1. I copied a test cases from latest previous patch (SOLR-1782.2013-04-10.patch).
          Instead of modify UnInvertedField.java (it is too complex to modify), I used DocValues to get a value from a field to aggregate stats.

          We've used it in our production for some time now since we stuck on version 4.2.1 due to some of our requirement. However, our fields that need stats.facet is DocValues field. So, I hope the previous patch test case cover a case where multi-value field is not a DocValue.

          Show
          Patanachai Tangchaisin added a comment - Here is patch for Solr 4.2.1. I copied a test cases from latest previous patch ( SOLR-1782 .2013-04-10.patch). Instead of modify UnInvertedField.java (it is too complex to modify), I used DocValues to get a value from a field to aggregate stats. We've used it in our production for some time now since we stuck on version 4.2.1 due to some of our requirement. However, our fields that need stats.facet is DocValues field. So, I hope the previous patch test case cover a case where multi-value field is not a DocValue.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Gerald DeConto
            • Votes:
              16 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:

                Development