Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8893

Wrong TermVector docfreq calculation with enabled ExactStatsCache

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 5.5
    • None
    • None
    • None

    Description

      Hi,

      we are currently facing the issue that some calculated values of the TV component are obviously wrong with enabled
      ExactStatsCache. --> shard-wide TV docfreq calculation

      This problem is subsequent to
      SOLR-8459 NPE using TermVectorComponent in combinition with ExactStatsCache

      Maybe the problem is very trivial and we configured something wrong ...

      So lets go deeper into that problem:

      1) The problem in summary:
      ==================
      We are requesting with enabled "tv.df", "tv.tf" and "tv.tf_idf" -->

      tv.df=true&tv.tf_idf=true&tv.tf=true
      

      additionally for debugging purposes we are requesting by calling

      termfreq("site_term_maincontent","abakus"),docfreq("site_maincontent_term_wdf","abakus"),ttf("site_maincontent_term_wdf","abakus")
      

      Our findings are:

      • the tv.tf as well as the termfreq seems to be correct
      • the tv.df as well as the docfreq is obviously wrong
      • the tv.tf_idf as well as ttf is wrong as well, I guess as subsequent fault of the tv.df (docfeq)

      2) What we have:
      ===========
      schema.xml:

      ...
              <field name="site_maincontent_term_wdf" type="text_token_wdf" indexed="true" stored="true" termVectors="true"
                     termPositions="true" termOffsets="true"/>
      ...
              <fieldType name="text_token_wdf" class="solr.TextField" positionIncrementGap="100">
                  <analyzer>
                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                      <filter class="solr.LowerCaseFilterFactory"/>
                      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
                  </analyzer>
              </fieldType>
      ...
      

      solrconfig.xml:

      ...
          <statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>
      ...
          <searchComponent name="tvComponent" class="org.apache.solr.handler.component.TermVectorComponent"/>
          <requestHandler name="/tvrh" class="org.apache.solr.handler.component.SearchHandler">
              <lst name="defaults">
                  <bool name="tv">true</bool>
              </lst>
              <arr name="last-components">
                  <str>tvComponent</str>
              </arr>
          </requestHandler>
      ...
      

      You can find out any details here:
      http://149.202.5.192:8820/solr/#/SingleDomainSite_34_shard1_replica1

      3) Examples
      ========

      If you are calling this link you can see that there are 6 existent documents containing the word "abakus" in the field "site_maincontent_term_wdf" ...

      http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?q=site_maincontent_term_wdf%3Aabakus+AND+site_headercode%3A200&shards.qt=%2Ftvrh&tv.fl=site_maincontent_term_wdf&tv.df=true&tv.tf_idf=true&tv.tf=true&fl=site_url_id,site_url,termfreq%28%22site_term_maincontent%22,%22abakus%22%29,docfreq%28%22site_maincontent_term_wdf%22,%22abakus%22%29,ttf%28%22site_maincontent_term_wdf%22,%22abakus%22%29

      But if you are looking into the field "docfreq" in the output documents, it is incorrect and always different (sould be always the same ...).

      "docfreq(field,term) returns the number of documents that contain the term in the field. This is a constant (the same value for all documents in the index)."

      Here is a link with enabled shards.info:
      http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?&wt=xml&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=10&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&shards.qt=/tvrh&shards.info=true

      Here is a link with enabled debug:
      http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?omitHeader=true&shards.qt=%2Ftvrh&wt=xml&json.nl=flat&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=1000&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&debugQuery=true

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              adaffner Andreas Daffner
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: