Solr
  1. Solr
  2. SOLR-3642

Count is inconsistent between facet and stats

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-BETA, 6.0
    • Labels:
      None
    • Environment:

      4.0 alpha on macos 10.6

      Description

      Steps to reproduce:

      1) Download apache-solr-4.0.0-ALPHA
      2) cd example; java -jar start.jar
      3) cd exampledocs; ./post.sh *.xml
      4) Use statsComponent to get the stats info for field 'popularity' based on facet 'cat'. And the 'count' for 'electronics' is 3
      http://localhost:8983/solr/collection1/select?q=cat:electronics&wt=json&rows=0&stats=true&stats.field=popularity&stats.facet=cat

      {

      stats_fields:

      {

      popularity:
      {

      min: 0,
      max: 10,
      count: 14,
      missing: 0,
      sum: 75,
      sumOfSquares: 503,
      mean: 5.357142857142857,
      stddev: 2.7902892835178013,
      facets:

      {

      cat:
      {

      music:

      { min: 10, max: 10, count: 1, missing: 0, sum: 10, sumOfSquares: 100, mean: 10, stddev: 0 }

      ,

      monitor:

      { min: 6, max: 6, count: 2, missing: 0, sum: 12, sumOfSquares: 72, mean: 6, stddev: 0 }

      ,
      hard drive:

      { min: 6, max: 6, count: 2, missing: 0, sum: 12, sumOfSquares: 72, mean: 6, stddev: 0 }

      ,

      scanner:

      { min: 6, max: 6, count: 1, missing: 0, sum: 6, sumOfSquares: 36, mean: 6, stddev: 0 }

      ,
      memory:

      { min: 0, max: 7, count: 3, missing: 0, sum: 12, sumOfSquares: 74, mean: 4, stddev: 3.605551275463989 }

      ,

      graphics card:

      { min: 7, max: 7, count: 2, missing: 0, sum: 14, sumOfSquares: 98, mean: 7, stddev: 0 }

      ,
      electronics:

      { min: 1, max: 7, count: 3, missing: 0, sum: 9, sumOfSquares: 51, mean: 3, stddev: 3.4641016151377544 }

      }
      }
      }
      }
      }
      5) Facet on 'cat' and the count is 14. http://localhost:8983/solr/collection1/select?q=cat:electronics&wt=json&rows=0&facet=true&facet.field=cat

      { cat: [ "electronics", 14, "memory", 3, "connector", 2, "graphics card", 2, "hard drive", 2, "monitor", 2, "camera", 1, "copier", 1, "multifunction printer", 1, "music", 1, "printer", 1, "scanner", 1, "currency", 0, "search", 0, "software", 0 ] }

      ,

      So from StatsComponent the count for 'electronics' cat is 3, while FacetComponent report 14 'electronics'. Is this a bug?

      Following is the field definition for 'cat'.
      <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          I believe the root problem here is that"stats.facet" is relatively naive and doesn't work with multivalued fields (and "cat" is multivalued)

          Show
          Hoss Man added a comment - I believe the root problem here is that"stats.facet" is relatively naive and doesn't work with multivalued fields (and "cat" is multivalued)
          Hide
          Yandong Yao added a comment - - edited

          You are right, Relative code below:

          SchemaField fsf = searcher.getSchema().getField(facetField);
          FieldType facetFieldType = fsf.getType();

          if (facetFieldType.isTokenized() || facetFieldType.isMultiValued())

          { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Stats can only facet on single-valued fields, not: " + facetField + "[" + facetFieldType + "]"); }

          try

          { facetTermsIndex = FieldCache.DEFAULT.getTermsIndex(searcher.getAtomicReader(), facetField); }

          Sounds like the condition is not enough for multiValued field check, should be:

          if (fsf.multiValued() || facetFieldType.isTokenized() || facetFieldType.isMultiValued())

          Show
          Yandong Yao added a comment - - edited You are right, Relative code below: SchemaField fsf = searcher.getSchema().getField(facetField); FieldType facetFieldType = fsf.getType(); if (facetFieldType.isTokenized() || facetFieldType.isMultiValued()) { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Stats can only facet on single-valued fields, not: " + facetField + "[" + facetFieldType + "]"); } try { facetTermsIndex = FieldCache.DEFAULT.getTermsIndex(searcher.getAtomicReader(), facetField); } Sounds like the condition is not enough for multiValued field check, should be: if (fsf.multiValued() || facetFieldType.isTokenized() || facetFieldType.isMultiValued())
          Hide
          Hoss Man added a comment -

          Nice catch!

          yeah, that entire error check is bogus – the properties of the field type don't matter at all, just the properties of the SchemaField (and tokenized isn't a valid check, because something could use "KeywordTokenizer" and would be valid to facet on)

          here's a patch with a test to ensure we fail instead of giving bogus results back (still running all tests to make sure i havne't broken something else)

          Show
          Hoss Man added a comment - Nice catch! yeah, that entire error check is bogus – the properties of the field type don't matter at all, just the properties of the SchemaField (and tokenized isn't a valid check, because something could use "KeywordTokenizer" and would be valid to facet on) here's a patch with a test to ensure we fail instead of giving bogus results back (still running all tests to make sure i havne't broken something else)
          Hide
          Hoss Man added a comment -

          Committed revision 1363555. - trunk
          Committed revision 1363556. - 4x

          Thanks Yandong!

          Show
          Hoss Man added a comment - Committed revision 1363555. - trunk Committed revision 1363556. - 4x Thanks Yandong!
          Hide
          Yandong Yao added a comment -

          Hi Hoss,

          Thanks for the quick commit, one further question: if i would like to implement stats with facet field which is multi-valued field, would you please provide some guidance on this?

          Currently StatsComponent don't support multivalued facet field because it is using FieldCache which don't support multivalued field. Any alternatives?

          If it is possible, I would like to create a JIRA issue for it and try to work on it.

          Thanks!

          Regards,
          Yandong

          Show
          Yandong Yao added a comment - Hi Hoss, Thanks for the quick commit, one further question: if i would like to implement stats with facet field which is multi-valued field, would you please provide some guidance on this? Currently StatsComponent don't support multivalued facet field because it is using FieldCache which don't support multivalued field. Any alternatives? If it is possible, I would like to create a JIRA issue for it and try to work on it. Thanks! Regards, Yandong
          Hide
          Hoss Man added a comment - - edited

          Yandong: the issue i linked this one to (SOLR-1782) is open precisely to try and address this problem – there is an (old) patch there that i honestly have not had time to look at, but you may want to take a look and see if it can be brought up to date and polished up to work and have good tests

          (IIRC: the reason i never really dug into it before was because the way StatsComponent deals with stats.facet in general struck me as being kind of kludgy and hard to understand, and i couldn't see a clean way to make it work well with both multivalued fields and arbitrary field types)

          (EDIT: i don't usually worry about typos, but i'm sorry for spelling your name wrong)

          Show
          Hoss Man added a comment - - edited Yandong: the issue i linked this one to ( SOLR-1782 ) is open precisely to try and address this problem – there is an (old) patch there that i honestly have not had time to look at, but you may want to take a look and see if it can be brought up to date and polished up to work and have good tests (IIRC: the reason i never really dug into it before was because the way StatsComponent deals with stats.facet in general struck me as being kind of kludgy and hard to understand, and i couldn't see a clean way to make it work well with both multivalued fields and arbitrary field types) (EDIT: i don't usually worry about typos, but i'm sorry for spelling your name wrong)
          Hide
          Yandong Yao added a comment -

          Hi Hoss,

          Thanks a lot, Will look at the patch at SOLR-1782 and try to apply to trunk.

          Regards,
          Yandong

          Show
          Yandong Yao added a comment - Hi Hoss, Thanks a lot, Will look at the patch at SOLR-1782 and try to apply to trunk. Regards, Yandong

            People

            • Assignee:
              Hoss Man
              Reporter:
              Yandong Yao
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development