Solr
  1. Solr
  2. SOLR-8496

Facet search count numbers are falsified by older document versions when multi-select is used

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.4
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Environment:
    • Flags:
      Important

      Description

      Our setup is based on multiple cores. In One core we have a multi-filed with integer values. and some other unimportant fields. We're using multi-faceting for this field.

      We're querying a test scenario with:

      http://localhost:8983/solr/core-name/select?q=dummyask: (true) AND manufacturer: false AND id: (15039 16882 10850 20781)&fq={!tag=professions}professions: (59)&fl=id&wt=json&indent=true&facet=true&facet.field={!ex=professions}professions
      
      • Query: (numDocs:48545, maxDoc:48545)
        <response>
        <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        </lst>
        <result name="response" numFound="4" start="0">
        <doc>
        <int name="id">10850</int>
        </doc>
        <doc>
        <int name="id">16882</int>
        </doc>
        <doc>
        <int name="id">15039</int>
        </doc>
        <doc>
        <int name="id">20781</int>
        </doc>
        </result>
        <lst name="facet_counts">
        <lst name="facet_queries"/>
        <lst name="facet_fields">
        <lst name="professions">
        <int name="59">4</int>
        </lst>
        </lst>
        <lst name="facet_dates"/>
        <lst name="facet_ranges"/>
        <lst name="facet_intervals"/>
        <lst name="facet_heatmaps"/>
        </lst>
        </response>
        
      • Then we update one document and change some fields (numDocs:48545, maxDoc:48546) The number of maxDocs is increased
        <response>
        <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        </lst>
        <result name="response" numFound="4" start="0">
        <doc>
        <int name="id">10850</int>
        </doc>
        <doc>
        <int name="id">16882</int>
        </doc>
        <doc>
        <int name="id">15039</int>
        </doc>
        <doc>
        <int name="id">20781</int>
        </doc>
        </result>
        <lst name="facet_counts">
        <lst name="facet_queries"/>
        <lst name="facet_fields">
        <lst name="professions">
        <int name="59">5</int>
        </lst>
        </lst>
        <lst name="facet_dates"/>
        <lst name="facet_ranges"/>
        <lst name="facet_intervals"/>
        <lst name="facet_heatmaps"/>
        </lst>
        </response>
        

      The Problem:
      In the first query, we're getting a facet count of 4, which is correct. After updating one document, we're getting 5 as a result wich is not correct.

      1. SOLR-8496.patch
        6 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Shawn Heisey added a comment -

          This should really be handled on the mailing list. I see that you asked on IRC, but you had already been gone from the IRC channel for 45 minutes before I got reconnected to my IRC session (at 8:45 AM in my timezone).

          What exactly does "based on multiple cores" mean?

          Show
          Shawn Heisey added a comment - This should really be handled on the mailing list. I see that you asked on IRC, but you had already been gone from the IRC channel for 45 minutes before I got reconnected to my IRC session (at 8:45 AM in my timezone). What exactly does "based on multiple cores" mean?
          Hide
          Shawn Heisey added a comment -

          SOLR-8540 is the same problem. They said they tried docValues and it did not fix the problem.

          Related to your attempts to replicate the problem with simple tests ... if all of the documents in a segment are deleted, the entire segment will be deleted, and the problem will disappear. You'll need to run a test where you index several documents at once (so they end up in the same segment), then replace only some of those documents. The segment will continue to exist as long as at least one document is not replaced.

          Show
          Shawn Heisey added a comment - SOLR-8540 is the same problem. They said they tried docValues and it did not fix the problem. Related to your attempts to replicate the problem with simple tests ... if all of the documents in a segment are deleted, the entire segment will be deleted, and the problem will disappear. You'll need to run a test where you index several documents at once (so they end up in the same segment), then replace only some of those documents. The segment will continue to exist as long as at least one document is not replaced.
          Hide
          Shawn Heisey added a comment -

          One workaround is optimizing the index, but this is not a good workaround for most people, especially when changes are small and very frequent.

          Show
          Shawn Heisey added a comment - One workaround is optimizing the index, but this is not a good workaround for most people, especially when changes are small and very frequent.
          Hide
          Shawn Heisey added a comment -

          Summary of the fundamental symptoms for a committer or contributor that knows the facet code: In recent versions, facets are no longer excluding deleted documents from the counts. It looks like this is the case whether facets are using docValues or not.

          Show
          Shawn Heisey added a comment - Summary of the fundamental symptoms for a committer or contributor that knows the facet code: In recent versions, facets are no longer excluding deleted documents from the counts. It looks like this is the case whether facets are using docValues or not.
          Hide
          Hoss Man added a comment -

          can we get some more details about your configs/schema? ... i'm trying to figure out enough details to be able to reproduce this.

          Using a trivial test with the techproducts example, i can't seem to reproduce...

          hossman@tray:~/lucene/5x_dev/solr$ bin/solr -e techproducts
          ...
          hossman@tray:~/lucene/5x_dev/solr$ curl 'http://localhost:8983/solr/techproducts/query?facet=true&facet.field=inStock&q=solr&omitHeader=true&rows=0'
          {
            "response":{"numFound":1,"start":0,"docs":[]
            },
            "facet_counts":{
              "facet_queries":{},
              "facet_fields":{
                "inStock":[
                  "true",1,
                  "false",0]},
              "facet_dates":{},
              "facet_ranges":{},
              "facet_intervals":{},
              "facet_heatmaps":{}}}
          ...
          hossman@tray:~/lucene/5x_dev/solr$ bin/post -c techproducts example/exampledocs/solr.xml 
          ...
          hossman@tray:~/lucene/5x_dev/solr$ curl 'http://localhost:8983/solr/techproducts/query?facet=true&facet.field=inStock&q=solr&omitHeader=true&rows=0'
          {
            "response":{"numFound":1,"start":0,"docs":[]
            },
            "facet_counts":{
              "facet_queries":{},
              "facet_fields":{
                "inStock":[
                  "true",1,
                  "false",0]},
              "facet_dates":{},
              "facet_ranges":{},
              "facet_intervals":{},
              "facet_heatmaps":{}}}
          hossman@tray:~/lucene/5x_dev/solr$ curl -sS 'http://localhost:8983/solr/techproducts/admin/luke?wt=json&indent=true' | egrep "maxDoc|numDoc"
              "numDocs":32,
              "maxDoc":33,
          
          Show
          Hoss Man added a comment - can we get some more details about your configs/schema? ... i'm trying to figure out enough details to be able to reproduce this. Using a trivial test with the techproducts example, i can't seem to reproduce... hossman@tray:~/lucene/5x_dev/solr$ bin/solr -e techproducts ... hossman@tray:~/lucene/5x_dev/solr$ curl 'http://localhost:8983/solr/techproducts/query?facet=true&facet.field=inStock&q=solr&omitHeader=true&rows=0' { "response":{"numFound":1,"start":0,"docs":[] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "inStock":[ "true",1, "false",0]}, "facet_dates":{}, "facet_ranges":{}, "facet_intervals":{}, "facet_heatmaps":{}}} ... hossman@tray:~/lucene/5x_dev/solr$ bin/post -c techproducts example/exampledocs/solr.xml ... hossman@tray:~/lucene/5x_dev/solr$ curl 'http://localhost:8983/solr/techproducts/query?facet=true&facet.field=inStock&q=solr&omitHeader=true&rows=0' { "response":{"numFound":1,"start":0,"docs":[] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "inStock":[ "true",1, "false",0]}, "facet_dates":{}, "facet_ranges":{}, "facet_intervals":{}, "facet_heatmaps":{}}} hossman@tray:~/lucene/5x_dev/solr$ curl -sS 'http://localhost:8983/solr/techproducts/admin/luke?wt=json&indent=true' | egrep "maxDoc|numDoc" "numDocs":32, "maxDoc":33,
          Hide
          Erick Erickson added a comment -

          I couldn't reproduce with a test case either in a JUnit test (non Cloud, one core).

          Show
          Erick Erickson added a comment - I couldn't reproduce with a test case either in a JUnit test (non Cloud, one core).
          Hide
          Hoss Man added a comment -

          Also: does this reproduce for you when indexing from scratch, or is this an index you originally built with an older version of Solr and then upgraded to 5.4? (trying to figure out if there are older segments and maybe the bug is specific to 5.4 reading deleted docs from those older segments)

          can you also run CheckIndex (command line) and provide all of that output?

          Show
          Hoss Man added a comment - Also: does this reproduce for you when indexing from scratch, or is this an index you originally built with an older version of Solr and then upgraded to 5.4? (trying to figure out if there are older segments and maybe the bug is specific to 5.4 reading deleted docs from those older segments) can you also run CheckIndex (command line) and provide all of that output?
          Hide
          Vasiliy Bout added a comment - - edited

          You do not use multi-select faceting in your simple test. Multi select local parameters are necessary to reproduce this issue. You can see SOLR-8540, there is a complete description of when this issue occurs.

          Show
          Vasiliy Bout added a comment - - edited You do not use multi-select faceting in your simple test. Multi select local parameters are necessary to reproduce this issue. You can see SOLR-8540 , there is a complete description of when this issue occurs.
          Hide
          Vasiliy Bout added a comment -

          When I tried to reproduce this issue on a new empty test core, I did the following:

          1. Fill the core with a number of documents
          2. Overwrite some documents in the core, i.e. add new documents with the same id as were added before.

          After that you can see, that in "Schema Browser" when you select your field and press "Load Term Info" counts for your field are incorrect (they take into account also old versions of the overwritten documents). And you can see that normal faceting gives correct results but multi select faceting gives incorrect results (the same you saw in "Load Term Info" counts).

          Show
          Vasiliy Bout added a comment - When I tried to reproduce this issue on a new empty test core, I did the following: 1. Fill the core with a number of documents 2. Overwrite some documents in the core, i.e. add new documents with the same id as were added before. After that you can see, that in "Schema Browser" when you select your field and press "Load Term Info" counts for your field are incorrect (they take into account also old versions of the overwritten documents). And you can see that normal faceting gives correct results but multi select faceting gives incorrect results (the same you saw in "Load Term Info" counts).
          Hide
          Andreas Müller added a comment -

          We did a complete new index from scratch. There 48545 docs in the index. The effect only occurred if there are 10k docs in the index. In the following our solr configuration and scheme and the output of CacheIndex

          solrconfig.xml
          <config>
            <luceneMatchVersion>4.5</luceneMatchVersion>
            <!--  The DirectoryFactory to use for indexes.
                  solr.StandardDirectoryFactory, the default, is filesystem based.
                  solr.RAMDirectoryFactory is memory based, not persistent, and doesn't work with replication. -->
            <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
          
            <updateHandler class="solr.DirectUpdateHandler2">
             <autoSoftCommit>
                  <maxTime>1000</maxTime>
              </autoSoftCommit>
              <autoCommit>
                  <maxTime>60000</maxTime> 
                  <openSearcher>false</openSearcher>
              </autoCommit>
            </updateHandler>
          
          
            <requestDispatcher handleSelect="true" >
              <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
            </requestDispatcher>
            
            <requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />
            <requestHandler name="/update" class="solr.UpdateRequestHandler" />
            <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
                
            <!-- config for the admin interface --> 
            <admin>
              <defaultQuery>solr</defaultQuery>
            </admin>
          
          </config>
          
          schema.xml
          <schema name="company comptest3" version="1.1">
          
              <types>
                  <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
          
                  <!-- boolean type: "true" or "false" -->
                  <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>
          
                  <!-- Default numeric field types. For faster range queries, consider the tint/tfloat/tlong/tdouble types. -->
                  <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
                  <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
                  <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
          
                  <!-- lat long fields -->
                  <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
          
                  <!-- A Trie based date field for faster date range queries and date faceting. -->
                  <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/>
          
                  <!-- A text field that only splits on whitespace for exact matching of words -->
                  <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
                      <analyzer>
                          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                      </analyzer>
                  </fieldType>
          
                  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
                      <analyzer type="index">
                          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                          <filter class="solr.LowerCaseFilterFactory"/>
                      </analyzer>
                      <analyzer type="query">
                          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                          <filter class="solr.LowerCaseFilterFactory"/>
                      </analyzer>
                  </fieldType>
          
                  <fieldType name="text_rev" class="solr.TextField" positionIncrementGap="100">
          
                      <analyzer type="index">
                          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
                          <filter class="solr.LowerCaseFilterFactory"/>
                          <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
                      </analyzer>
                      <analyzer type="query">
                          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
                          <filter class="solr.LowerCaseFilterFactory"/>
                      </analyzer>
          
                  </fieldType>
          
                  <fieldtype name="phonetic" stored="true" indexed="true" class="solr.TextField" >
                      <analyzer>
                          <tokenizer class="solr.StandardTokenizerFactory"/>
                          <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
                      </analyzer>
                  </fieldtype>
          
                  <!-- lowercases the entire field value, keeping it as a single token.   -->
                  <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
                      <analyzer>
                          <tokenizer class="solr.KeywordTokenizerFactory"/>
                          <filter class="solr.LowerCaseFilterFactory" />
                      </analyzer>
                  </fieldType>
          
                  <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
          
              </types>
          
              <fields>
                  <!-- general -->
                  <field name="id"                    type="int"           indexed="true" stored="true" multiValued="false" required="true"/>
                  <field name="dummyask"              type="boolean"       indexed="true" stored="true" multiValued="false" />
                  <field name="disabled"              type="boolean"       indexed="true" stored="true" multiValued="false" />
                  <field name="closed"                type="boolean"       indexed="true" stored="true" multiValued="false" />
                  <field name="show"                  type="boolean"       indexed="true" stored="true" multiValued="false" />
                  <field name="pagecalls"             type="int"           indexed="true" stored="true" multiValued="false" />
                  <field name="publicated"            type="tdate"         indexed="true" stored="true" multiValued="false" />
          
                  <field name="name"                  type="text_rev"      indexed="true" stored="true" multiValued="false" />
                  <field name="name_filtered"         type="lowercase"     indexed="true" stored="true" multiValued="false" />
                  <field name="name_phonetic"         type="lowercase"     indexed="true" stored="true" multiValued="false" />
                  <field name="manufacturer"          type="boolean"       indexed="true" stored="true" multiValued="false" />
                  <field name="fulltext"              type="text_rev"      indexed="true" stored="true" multiValued="false" />
                  <field name="owner"                 type="text_rev"      indexed="true" stored="true" multiValued="true" />
                  <field name="member"                type="boolean"       indexed="true" stored="true" multiValued="false" />
                  <field name="professions"           type="long"          indexed="true" stored="true" multiValued="true" />
                  <field name="founding"              type="tdate"         indexed="true" stored="true" multiValued="false" />
                  <field name="employee_number"       type="int"           indexed="true" stored="true" multiValued="false" />
                  <field name="jobs"                  type="boolean"       indexed="true" stored="true" multiValued="false" />
                  <field name="image"                 type="text"          indexed="false" stored="true" multiValued="false" />
          
                  <!-- geografic options -->
                  <field name="ort"                   type="lowercase"     indexed="true" stored="true" multiValued="true" />
                  <field name="plz"                   type="lowercase"     indexed="true" stored="true" multiValued="true" />
                  <field name="land"                  type="lowercase"     indexed="true" stored="true" multiValued="true" />
                  <field name="bundesland"            type="lowercase"     indexed="true" stored="true" multiValued="true" />
                  <field name="lat"                   type="double"        indexed="true" stored="true" multiValued="true" />
                  <field name="lon"                   type="double"        indexed="true" stored="true" multiValued="true" />
                  <field name="geo"                   type="location"      indexed="true" stored="true" multiValued="false" />
                  <field name="geo_0_coordinate"      type="double"        indexed="true" stored="true" multiValued="false" />
                  <field name="geo_1_coordinate"      type="double"        indexed="true" stored="true" multiValued="false" />
          
                  <!-- display fields -->
                  <field name="profession_display"    type="text"          indexed="false" stored="true" multiValued="true" />
                  <field name="address_display"       type="text_rev"      indexed="true"  stored="true" multiValued="true" />
          
                  <!-- realized projects -->
                  <field name="done_projects"         type="lowercase"     indexed="true"  stored="true" multiValued="true"/>
          
                  <!-- projects in planing / projects in construction -->
                  <field name="projects"              type="long"          indexed="true"  stored="true" multiValued="true"/>
          
                  <!-- references -->
                  <field name="references"            type="lowercase"     indexed="true" stored="true" multiValued="true"/>
          
                  <field name="reference_info"        type="text"          indexed="false" stored="true" multiValued="false"/>
                  <field name="relevance"             type="int"           indexed="true" stored="true" multiValued="false"/>
          
                  <field name="_version_"             type="long"          indexed="true" stored="true"/>
          
              </fields>
          
              <!-- field to use to determine and enforce document uniqueness. -->
              <uniqueKey>id</uniqueKey>
          
              <!-- field for the QueryParser to use when an explicit fieldname is absent -->
              <defaultSearchField>name</defaultSearchField>
          
              <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
              <solrQueryParser defaultOperator="OR"/>
          </schema>
          
          java -cp ../server/solr-webapp/webapp/WEB-INF/lib/lucene-core-5.4.0.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex ../server/solr/companies/data/index
          Opening index @ ../server/solr/companies/data/index
          
          Segments file=segments_4t numSegments=10 version=5.4.0 id=8b82erk4sdq7dvuluswzgthh5 format= userData={commitTimeMSec=1452862011769}
            1 of 10: name=_3fw maxDoc=44624
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgth5m
              codec=Lucene54
              compound=false
              numFiles=11
              size (MB)=54.18
              diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452857890899}
              has deletions [delGen=5]
              test: open reader.........OK [took 1.244 sec]
              test: check integrity.....OK [took 0.148 sec]
              test: check live docs.....OK [500 deleted docs] [took 0.011 sec]
              test: field infos.........OK [35 fields] [took 0.001 sec]
              test: field norms.........OK [12 fields] [took 0.048 sec]
              test: terms, freq, prox...OK [1119766 terms; 5522980 terms/docs pairs; 6617236 tokens] [took 5.664 sec]
              test: stored fields.......OK [1786969 total field count; avg 40.5 fields per doc] [took 2.215 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.001 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.001 sec]
          
            2 of 10: name=_3pc maxDoc=1476
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgthf8
              codec=Lucene54
              compound=true
              numFiles=3
              size (MB)=1.988
              diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452861829493}
              no deletions
              test: open reader.........OK [took 0.034 sec]
              test: check integrity.....OK [took 0.006 sec]
              test: check live docs.....OK [took 0.000 sec]
              test: field infos.........OK [35 fields] [took 0.000 sec]
              test: field norms.........OK [12 fields] [took 0.001 sec]
              test: terms, freq, prox...OK [67708 terms; 174468 terms/docs pairs; 204435 tokens] [took 0.938 sec]
              test: stored fields.......OK [59440 total field count; avg 40.3 fields per doc] [took 0.052 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.001 sec]
          
            3 of 10: name=_3pw maxDoc=1426
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgthfs
              codec=Lucene54
              compound=true
              numFiles=3
              size (MB)=2.08
              diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452861864304}
              no deletions
              test: open reader.........OK [took 0.019 sec]
              test: check integrity.....OK [took 0.015 sec]
              test: check live docs.....OK [took 0.000 sec]
              test: field infos.........OK [35 fields] [took 0.000 sec]
              test: field norms.........OK [12 fields] [took 0.001 sec]
              test: terms, freq, prox...OK [67794 terms; 175792 terms/docs pairs; 216683 tokens] [took 0.836 sec]
              test: stored fields.......OK [62036 total field count; avg 43.5 fields per doc] [took 0.056 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
          
            4 of 10: name=_3pm maxDoc=1398
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgthfi
              codec=Lucene54
              compound=true
              numFiles=3
              size (MB)=2.035
              diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452861844413}
              no deletions
              test: open reader.........OK [took 0.016 sec]
              test: check integrity.....OK [took 0.017 sec]
              test: check live docs.....OK [took 0.000 sec]
              test: field infos.........OK [35 fields] [took 0.000 sec]
              test: field norms.........OK [12 fields] [took 0.001 sec]
              test: terms, freq, prox...OK [67878 terms; 173372 terms/docs pairs; 213758 tokens] [took 0.162 sec]
              test: stored fields.......OK [59498 total field count; avg 42.6 fields per doc] [took 0.048 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
          
            5 of 10: name=_3r1 maxDoc=114
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgthgy
              codec=Lucene54
              compound=true
              numFiles=3
              size (MB)=0.658
              diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452861925974}
              no deletions
              test: open reader.........OK [took 0.008 sec]
              test: check integrity.....OK [took 0.002 sec]
              test: check live docs.....OK [took 0.000 sec]
              test: field infos.........OK [35 fields] [took 0.000 sec]
              test: field norms.........OK [12 fields] [took 0.000 sec]
              test: terms, freq, prox...OK [18002 terms; 41857 terms/docs pairs; 64375 tokens] [took 0.061 sec]
              test: stored fields.......OK [14505 total field count; avg 127.2 fields per doc] [took 0.018 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
          
            6 of 10: name=_3r2 maxDoc=1
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgthgz
              codec=Lucene54
              compound=false
              numFiles=10
              size (MB)=0.026
              diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861930569}
              no deletions
              test: open reader.........OK [took 0.017 sec]
              test: check integrity.....OK [took 0.007 sec]
              test: check live docs.....OK [took 0.000 sec]
              test: field infos.........OK [34 fields] [took 0.000 sec]
              test: field norms.........OK [12 fields] [took 0.000 sec]
              test: terms, freq, prox...OK [809 terms; 809 terms/docs pairs; 1374 tokens] [took 0.010 sec]
              test: stored fields.......OK [324 total field count; avg 324.0 fields per doc] [took 0.001 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.001 sec]
          
            7 of 10: name=_3r3 maxDoc=1
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgthh0
              codec=Lucene54
              compound=false
              numFiles=10
              size (MB)=0.046
              diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861931845}
              no deletions
              test: open reader.........OK [took 0.022 sec]
              test: check integrity.....OK [took 0.000 sec]
              test: check live docs.....OK [took 0.000 sec]
              test: field infos.........OK [35 fields] [took 0.000 sec]
              test: field norms.........OK [12 fields] [took 0.000 sec]
              test: terms, freq, prox...OK [1611 terms; 1611 terms/docs pairs; 2890 tokens] [took 0.008 sec]
              test: stored fields.......OK [805 total field count; avg 805.0 fields per doc] [took 0.001 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
          
            8 of 10: name=_3r4 maxDoc=2
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgthh1
              codec=Lucene54
              compound=false
              numFiles=10
              size (MB)=0.097
              diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861933112}
              no deletions
              test: open reader.........OK [took 0.024 sec]
              test: check integrity.....OK [took 0.001 sec]
              test: check live docs.....OK [took 0.000 sec]
              test: field infos.........OK [35 fields] [took 0.000 sec]
              test: field norms.........OK [12 fields] [took 0.000 sec]
              test: terms, freq, prox...OK [3333 terms; 3742 terms/docs pairs; 8204 tokens] [took 0.010 sec]
              test: stored fields.......OK [1176 total field count; avg 588.0 fields per doc] [took 0.005 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
          
            9 of 10: name=_3r5 maxDoc=2
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgthh2
              codec=Lucene54
              compound=false
              numFiles=10
              size (MB)=0.07
              diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861935365}
              no deletions
              test: open reader.........OK [took 0.010 sec]
              test: check integrity.....OK [took 0.001 sec]
              test: check live docs.....OK [took 0.000 sec]
              test: field infos.........OK [35 fields] [took 0.000 sec]
              test: field norms.........OK [12 fields] [took 0.001 sec]
              test: terms, freq, prox...OK [2346 terms; 2583 terms/docs pairs; 4660 tokens] [took 0.010 sec]
              test: stored fields.......OK [1051 total field count; avg 525.5 fields per doc] [took 0.002 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
          
            10 of 10: name=_3r6 maxDoc=1
              version=5.4.0
              id=8b82erk4sdq7dvuluswzgthh4
              codec=Lucene54
              compound=false
              numFiles=10
              size (MB)=0.073
              diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861952782}
              no deletions
              test: open reader.........OK [took 0.008 sec]
              test: check integrity.....OK [took 0.001 sec]
              test: check live docs.....OK [took 0.000 sec]
              test: field infos.........OK [35 fields] [took 0.000 sec]
              test: field norms.........OK [12 fields] [took 0.000 sec]
              test: terms, freq, prox...OK [2581 terms; 2581 terms/docs pairs; 5101 tokens] [took 0.011 sec]
              test: stored fields.......OK [1241 total field count; avg 1241.0 fields per doc] [took 0.001 sec]
              test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.003 sec]
              test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
          
          No problems were detected with this index.
          
          Show
          Andreas Müller added a comment - We did a complete new index from scratch. There 48545 docs in the index. The effect only occurred if there are 10k docs in the index. In the following our solr configuration and scheme and the output of CacheIndex solrconfig.xml <config> <luceneMatchVersion>4.5</luceneMatchVersion> <!-- The DirectoryFactory to use for indexes. solr.StandardDirectoryFactory, the default , is filesystem based. solr.RAMDirectoryFactory is memory based, not persistent, and doesn't work with replication. --> <directoryFactory name= "DirectoryFactory" class= "${solr.directoryFactory:solr.StandardDirectoryFactory}" /> <updateHandler class= "solr.DirectUpdateHandler2" > <autoSoftCommit> <maxTime>1000</maxTime> </autoSoftCommit> <autoCommit> <maxTime>60000</maxTime> <openSearcher> false </openSearcher> </autoCommit> </updateHandler> <requestDispatcher handleSelect= " true " > <requestParsers enableRemoteStreaming= " false " multipartUploadLimitInKB= "2048" /> </requestDispatcher> <requestHandler name= "standard" class= "solr.StandardRequestHandler" default = " true " /> <requestHandler name= "/update" class= "solr.UpdateRequestHandler" /> <requestHandler name= "/admin/" class= "org.apache.solr.handler.admin.AdminHandlers" /> <!-- config for the admin interface --> <admin> <defaultQuery>solr</defaultQuery> </admin> </config> schema.xml <schema name= "company comptest3" version= "1.1" > <types> <fieldType name= "string" class= "solr.StrField" sortMissingLast= " true " omitNorms= " true " /> <!-- boolean type: " true " or " false " --> <fieldType name= " boolean " class= "solr.BoolField" sortMissingLast= " true " omitNorms= " true " /> <!-- Default numeric field types. For faster range queries, consider the tint/tfloat/tlong/tdouble types. --> <fieldType name= " int " class= "solr.TrieIntField" precisionStep= "0" omitNorms= " true " positionIncrementGap= "0" /> <fieldType name= "date" class= "solr.TrieDateField" omitNorms= " true " precisionStep= "0" positionIncrementGap= "0" /> <fieldType name= " long " class= "solr.TrieLongField" precisionStep= "0" omitNorms= " true " positionIncrementGap= "0" /> <!-- lat long fields --> <fieldType name= " double " class= "solr.TrieDoubleField" precisionStep= "0" omitNorms= " true " positionIncrementGap= "0" /> <!-- A Trie based date field for faster date range queries and date faceting. --> <fieldType name= "tdate" class= "solr.TrieDateField" omitNorms= " true " precisionStep= "6" positionIncrementGap= "0" /> <!-- A text field that only splits on whitespace for exact matching of words --> <fieldType name= "text_ws" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <tokenizer class= "solr.WhitespaceTokenizerFactory" /> </analyzer> </fieldType> <fieldType name= "text" class= "solr.TextField" positionIncrementGap= "100" > <analyzer type= "index" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "1" catenateNumbers= "1" catenateAll= "0" splitOnCaseChange= "1" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "0" catenateNumbers= "0" catenateAll= "0" splitOnCaseChange= "1" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> <fieldType name= "text_rev" class= "solr.TextField" positionIncrementGap= "100" > <analyzer type= "index" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "1" catenateNumbers= "1" catenateAll= "0" splitOnCaseChange= "0" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.ReversedWildcardFilterFactory" withOriginal= " true " maxPosAsterisk= "3" maxPosQuestion= "2" maxFractionAsterisk= "0.33" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "0" catenateNumbers= "0" catenateAll= "0" splitOnCaseChange= "0" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> <fieldtype name= "phonetic" stored= " true " indexed= " true " class= "solr.TextField" > <analyzer> <tokenizer class= "solr.StandardTokenizerFactory" /> <filter class= "solr.DoubleMetaphoneFilterFactory" inject= " false " /> </analyzer> </fieldtype> <!-- lowercases the entire field value, keeping it as a single token. --> <fieldType name= "lowercase" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <tokenizer class= "solr.KeywordTokenizerFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> <fieldType name= "location" class= "solr.LatLonType" subFieldSuffix= "_coordinate" /> </types> <fields> <!-- general --> <field name= "id" type= " int " indexed= " true " stored= " true " multiValued= " false " required= " true " /> <field name= "dummyask" type= " boolean " indexed= " true " stored= " true " multiValued= " false " /> <field name= "disabled" type= " boolean " indexed= " true " stored= " true " multiValued= " false " /> <field name= "closed" type= " boolean " indexed= " true " stored= " true " multiValued= " false " /> <field name= "show" type= " boolean " indexed= " true " stored= " true " multiValued= " false " /> <field name= "pagecalls" type= " int " indexed= " true " stored= " true " multiValued= " false " /> <field name= "publicated" type= "tdate" indexed= " true " stored= " true " multiValued= " false " /> <field name= "name" type= "text_rev" indexed= " true " stored= " true " multiValued= " false " /> <field name= "name_filtered" type= "lowercase" indexed= " true " stored= " true " multiValued= " false " /> <field name= "name_phonetic" type= "lowercase" indexed= " true " stored= " true " multiValued= " false " /> <field name= "manufacturer" type= " boolean " indexed= " true " stored= " true " multiValued= " false " /> <field name= "fulltext" type= "text_rev" indexed= " true " stored= " true " multiValued= " false " /> <field name= "owner" type= "text_rev" indexed= " true " stored= " true " multiValued= " true " /> <field name= "member" type= " boolean " indexed= " true " stored= " true " multiValued= " false " /> <field name= "professions" type= " long " indexed= " true " stored= " true " multiValued= " true " /> <field name= "founding" type= "tdate" indexed= " true " stored= " true " multiValued= " false " /> <field name= "employee_number" type= " int " indexed= " true " stored= " true " multiValued= " false " /> <field name= "jobs" type= " boolean " indexed= " true " stored= " true " multiValued= " false " /> <field name= "image" type= "text" indexed= " false " stored= " true " multiValued= " false " /> <!-- geografic options --> <field name= "ort" type= "lowercase" indexed= " true " stored= " true " multiValued= " true " /> <field name= "plz" type= "lowercase" indexed= " true " stored= " true " multiValued= " true " /> <field name= "land" type= "lowercase" indexed= " true " stored= " true " multiValued= " true " /> <field name= "bundesland" type= "lowercase" indexed= " true " stored= " true " multiValued= " true " /> <field name= "lat" type= " double " indexed= " true " stored= " true " multiValued= " true " /> <field name= "lon" type= " double " indexed= " true " stored= " true " multiValued= " true " /> <field name= "geo" type= "location" indexed= " true " stored= " true " multiValued= " false " /> <field name= "geo_0_coordinate" type= " double " indexed= " true " stored= " true " multiValued= " false " /> <field name= "geo_1_coordinate" type= " double " indexed= " true " stored= " true " multiValued= " false " /> <!-- display fields --> <field name= "profession_display" type= "text" indexed= " false " stored= " true " multiValued= " true " /> <field name= "address_display" type= "text_rev" indexed= " true " stored= " true " multiValued= " true " /> <!-- realized projects --> <field name= "done_projects" type= "lowercase" indexed= " true " stored= " true " multiValued= " true " /> <!-- projects in planing / projects in construction --> <field name= "projects" type= " long " indexed= " true " stored= " true " multiValued= " true " /> <!-- references --> <field name= "references" type= "lowercase" indexed= " true " stored= " true " multiValued= " true " /> <field name= "reference_info" type= "text" indexed= " false " stored= " true " multiValued= " false " /> <field name= "relevance" type= " int " indexed= " true " stored= " true " multiValued= " false " /> <field name= "_version_" type= " long " indexed= " true " stored= " true " /> </fields> <!-- field to use to determine and enforce document uniqueness. --> <uniqueKey>id</uniqueKey> <!-- field for the QueryParser to use when an explicit fieldname is absent --> <defaultSearchField>name</defaultSearchField> <!-- SolrQueryParser configuration: defaultOperator= "AND|OR" --> <solrQueryParser defaultOperator= "OR" /> </schema> java -cp ../server/solr-webapp/webapp/WEB-INF/lib/lucene-core-5.4.0.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex ../server/solr/companies/data/index Opening index @ ../server/solr/companies/data/index Segments file=segments_4t numSegments=10 version=5.4.0 id=8b82erk4sdq7dvuluswzgthh5 format= userData={commitTimeMSec=1452862011769} 1 of 10: name=_3fw maxDoc=44624 version=5.4.0 id=8b82erk4sdq7dvuluswzgth5m codec=Lucene54 compound= false numFiles=11 size (MB)=54.18 diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452857890899} has deletions [delGen=5] test: open reader.........OK [took 1.244 sec] test: check integrity.....OK [took 0.148 sec] test: check live docs.....OK [500 deleted docs] [took 0.011 sec] test: field infos.........OK [35 fields] [took 0.001 sec] test: field norms.........OK [12 fields] [took 0.048 sec] test: terms, freq, prox...OK [1119766 terms; 5522980 terms/docs pairs; 6617236 tokens] [took 5.664 sec] test: stored fields.......OK [1786969 total field count; avg 40.5 fields per doc] [took 2.215 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.001 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.001 sec] 2 of 10: name=_3pc maxDoc=1476 version=5.4.0 id=8b82erk4sdq7dvuluswzgthf8 codec=Lucene54 compound= true numFiles=3 size (MB)=1.988 diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452861829493} no deletions test: open reader.........OK [took 0.034 sec] test: check integrity.....OK [took 0.006 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [35 fields] [took 0.000 sec] test: field norms.........OK [12 fields] [took 0.001 sec] test: terms, freq, prox...OK [67708 terms; 174468 terms/docs pairs; 204435 tokens] [took 0.938 sec] test: stored fields.......OK [59440 total field count; avg 40.3 fields per doc] [took 0.052 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.001 sec] 3 of 10: name=_3pw maxDoc=1426 version=5.4.0 id=8b82erk4sdq7dvuluswzgthfs codec=Lucene54 compound= true numFiles=3 size (MB)=2.08 diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452861864304} no deletions test: open reader.........OK [took 0.019 sec] test: check integrity.....OK [took 0.015 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [35 fields] [took 0.000 sec] test: field norms.........OK [12 fields] [took 0.001 sec] test: terms, freq, prox...OK [67794 terms; 175792 terms/docs pairs; 216683 tokens] [took 0.836 sec] test: stored fields.......OK [62036 total field count; avg 43.5 fields per doc] [took 0.056 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] 4 of 10: name=_3pm maxDoc=1398 version=5.4.0 id=8b82erk4sdq7dvuluswzgthfi codec=Lucene54 compound= true numFiles=3 size (MB)=2.035 diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452861844413} no deletions test: open reader.........OK [took 0.016 sec] test: check integrity.....OK [took 0.017 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [35 fields] [took 0.000 sec] test: field norms.........OK [12 fields] [took 0.001 sec] test: terms, freq, prox...OK [67878 terms; 173372 terms/docs pairs; 213758 tokens] [took 0.162 sec] test: stored fields.......OK [59498 total field count; avg 42.6 fields per doc] [took 0.048 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] 5 of 10: name=_3r1 maxDoc=114 version=5.4.0 id=8b82erk4sdq7dvuluswzgthgy codec=Lucene54 compound= true numFiles=3 size (MB)=0.658 diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_66-internal-b17, source=merge, mergeFactor=10, os.version=3.11-2-amd64, timestamp=1452861925974} no deletions test: open reader.........OK [took 0.008 sec] test: check integrity.....OK [took 0.002 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [35 fields] [took 0.000 sec] test: field norms.........OK [12 fields] [took 0.000 sec] test: terms, freq, prox...OK [18002 terms; 41857 terms/docs pairs; 64375 tokens] [took 0.061 sec] test: stored fields.......OK [14505 total field count; avg 127.2 fields per doc] [took 0.018 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] 6 of 10: name=_3r2 maxDoc=1 version=5.4.0 id=8b82erk4sdq7dvuluswzgthgz codec=Lucene54 compound= false numFiles=10 size (MB)=0.026 diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861930569} no deletions test: open reader.........OK [took 0.017 sec] test: check integrity.....OK [took 0.007 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [34 fields] [took 0.000 sec] test: field norms.........OK [12 fields] [took 0.000 sec] test: terms, freq, prox...OK [809 terms; 809 terms/docs pairs; 1374 tokens] [took 0.010 sec] test: stored fields.......OK [324 total field count; avg 324.0 fields per doc] [took 0.001 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.001 sec] 7 of 10: name=_3r3 maxDoc=1 version=5.4.0 id=8b82erk4sdq7dvuluswzgthh0 codec=Lucene54 compound= false numFiles=10 size (MB)=0.046 diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861931845} no deletions test: open reader.........OK [took 0.022 sec] test: check integrity.....OK [took 0.000 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [35 fields] [took 0.000 sec] test: field norms.........OK [12 fields] [took 0.000 sec] test: terms, freq, prox...OK [1611 terms; 1611 terms/docs pairs; 2890 tokens] [took 0.008 sec] test: stored fields.......OK [805 total field count; avg 805.0 fields per doc] [took 0.001 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] 8 of 10: name=_3r4 maxDoc=2 version=5.4.0 id=8b82erk4sdq7dvuluswzgthh1 codec=Lucene54 compound= false numFiles=10 size (MB)=0.097 diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861933112} no deletions test: open reader.........OK [took 0.024 sec] test: check integrity.....OK [took 0.001 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [35 fields] [took 0.000 sec] test: field norms.........OK [12 fields] [took 0.000 sec] test: terms, freq, prox...OK [3333 terms; 3742 terms/docs pairs; 8204 tokens] [took 0.010 sec] test: stored fields.......OK [1176 total field count; avg 588.0 fields per doc] [took 0.005 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] 9 of 10: name=_3r5 maxDoc=2 version=5.4.0 id=8b82erk4sdq7dvuluswzgthh2 codec=Lucene54 compound= false numFiles=10 size (MB)=0.07 diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861935365} no deletions test: open reader.........OK [took 0.010 sec] test: check integrity.....OK [took 0.001 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [35 fields] [took 0.000 sec] test: field norms.........OK [12 fields] [took 0.001 sec] test: terms, freq, prox...OK [2346 terms; 2583 terms/docs pairs; 4660 tokens] [took 0.010 sec] test: stored fields.......OK [1051 total field count; avg 525.5 fields per doc] [took 0.002 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] 10 of 10: name=_3r6 maxDoc=1 version=5.4.0 id=8b82erk4sdq7dvuluswzgthh4 codec=Lucene54 compound= false numFiles=10 size (MB)=0.073 diagnostics = {java.runtime.version=1.8.0_66-internal-b17, java.vendor=Oracle Corporation, java.version=1.8.0_66-internal, java.vm.version=25.66-b17, lucene.version=5.4.0, os=Linux, os.arch=amd64, os.version=3.11-2-amd64, source=flush, timestamp=1452861952782} no deletions test: open reader.........OK [took 0.008 sec] test: check integrity.....OK [took 0.001 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [35 fields] [took 0.000 sec] test: field norms.........OK [12 fields] [took 0.000 sec] test: terms, freq, prox...OK [2581 terms; 2581 terms/docs pairs; 5101 tokens] [took 0.011 sec] test: stored fields.......OK [1241 total field count; avg 1241.0 fields per doc] [took 0.001 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.003 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] No problems were detected with this index.
          Hide
          Shawn Heisey added a comment -

          My apologies, Hoss Man. My summary of the issue was incomplete and did not mention multi-select, which it should have.

          Show
          Shawn Heisey added a comment - My apologies, Hoss Man . My summary of the issue was incomplete and did not mention multi-select, which it should have.
          Hide
          Yonik Seeley added a comment -

          I can confirm this bug. Looking into it...

          Show
          Yonik Seeley added a comment - I can confirm this bug. Looking into it...
          Hide
          Vasiliy Bout added a comment - - edited

          I developed a small example on how to reproduce this problem with the completely new core with a very simple schema and about 20 documents in the core.

          First of all, I created a new core with the following schema.xml:

          <?xml version="1.0" ?>
          <schema name="basic" version="1.1">
              <types>
                  <fieldType name="string" class="solr.StrField" omitNorms="true" indexed="true" stored="true"/>
                  <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0" indexed="true" stored="true"/>
              </types>
              <fields>
                  <field name="id" type="string" required="true"/>
                  <field name="foo_s" type="string"/>
                  <field name="bar_s" type="string" docValues="true"/>
                  <field name="foo_i" type="int"/>
                  <field name="bar_i" type="int" docValues="true"/>
              </fields>
              <uniqueKey>id</uniqueKey>
              <solrQueryParser defaultOperator="OR"/>
          </schema>
          

          After that, I generated a set of documents to fill the core with. I launched python interpreter in the terminal and typed the following oneliner:

          [ {"id":i,"foo_i":i,"bar_i":i,"foo_s":i,"bar_s":i} for i in range(1, 21) ]
          

          It gave me a set of 20 documents. This is the same set but slightly formatted to be human readable:

          [
              {'bar_s': 1, 'foo_i': 1, 'bar_i': 1, 'foo_s': 1, 'id': 1},
              {'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2},
              {'bar_s': 3, 'foo_i': 3, 'bar_i': 3, 'foo_s': 3, 'id': 3},
              {'bar_s': 4, 'foo_i': 4, 'bar_i': 4, 'foo_s': 4, 'id': 4},
              {'bar_s': 5, 'foo_i': 5, 'bar_i': 5, 'foo_s': 5, 'id': 5},
              {'bar_s': 6, 'foo_i': 6, 'bar_i': 6, 'foo_s': 6, 'id': 6},
              {'bar_s': 7, 'foo_i': 7, 'bar_i': 7, 'foo_s': 7, 'id': 7},
              {'bar_s': 8, 'foo_i': 8, 'bar_i': 8, 'foo_s': 8, 'id': 8},
              {'bar_s': 9, 'foo_i': 9, 'bar_i': 9, 'foo_s': 9, 'id': 9},
              {'bar_s': 10, 'foo_i': 10, 'bar_i': 10, 'foo_s': 10, 'id': 10},
              {'bar_s': 11, 'foo_i': 11, 'bar_i': 11, 'foo_s': 11, 'id': 11},
              {'bar_s': 12, 'foo_i': 12, 'bar_i': 12, 'foo_s': 12, 'id': 12},
              {'bar_s': 13, 'foo_i': 13, 'bar_i': 13, 'foo_s': 13, 'id': 13},
              {'bar_s': 14, 'foo_i': 14, 'bar_i': 14, 'foo_s': 14, 'id': 14},
              {'bar_s': 15, 'foo_i': 15, 'bar_i': 15, 'foo_s': 15, 'id': 15},
              {'bar_s': 16, 'foo_i': 16, 'bar_i': 16, 'foo_s': 16, 'id': 16},
              {'bar_s': 17, 'foo_i': 17, 'bar_i': 17, 'foo_s': 17, 'id': 17},
              {'bar_s': 18, 'foo_i': 18, 'bar_i': 18, 'foo_s': 18, 'id': 18},
              {'bar_s': 19, 'foo_i': 19, 'bar_i': 19, 'foo_s': 19, 'id': 19},
              {'bar_s': 20, 'foo_i': 20, 'bar_i': 20, 'foo_s': 20, 'id': 20}
          ]
          

          After that I opened Solr Admin page in my browser, went to the "Documents" tab of my core and filled the core with the set of documents above. I selected the following parameters:

          • Request-Handler (qt): /update/json;
          • Document Type: Solr Command (raw XML or JSON);
          • Documents set to the above JSON generate in python interpreter.

          After the Solr core is filled with documents, I add a single document once again, so this document overwrites the previous one:

          {'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2}
          

          Now when I look at the "Overview" tab I see the following statistics:

          Last Modified: less than a minute ago
          Num Docs: 20
          Max Doc: 21
          Heap Memory Usage: -1
          Deleted Docs: 1
          Version: 7
          Segment Count: 2
          

          And at this stage all multi select facet queries give incorrect results. Since all the documents in the core have unique values for all fields, all facet queries should give count 1 for all values for all fields. Simple facet queries return correct results:

          query is q=*:*&rows=0&facet=true&facet.limit=1&facet.field=foo_s&facet.field=foo_i&facet.field=bar_s&facet.field=bar_i
          response is

          {
            "responseHeader":{"status":0,"QTime":1},
            "response":{"numFound":20,"start":0,"docs":[]},
            "facet_counts":{
              "facet_queries":{},
              "facet_fields":{
                "foo_s":["1",1],
                "foo_i":["1",1],
                "bar_s":["1",1],
                "bar_i":["1",1]
              },
              "facet_dates":{},
              "facet_ranges":{},
              "facet_intervals":{},
              "facet_heatmaps":{}}}
          

          And this is what we get for multi select facet query:

          query is q=*:*&fq={!tag=a}id:*&rows=0&facet=true&facet.limit=1&facet.field={!ex=a}foo_s&facet.field={!ex=a}foo_i&facet.field={!ex=a}bar_s&facet.field={!ex=a}bar_i
          response is

          {
            "responseHeader":{"status":0,"QTime":2},
            "response":{"numFound":20,"start":0,"docs":[]},
            "facet_counts":{
              "facet_queries":{},
              "facet_fields":{
                "foo_s":["2",2],
                "foo_i":["2",2],
                "bar_s":["2",2],
                "bar_i":["2",2]},
              "facet_dates":{},
              "facet_ranges":{},
              "facet_intervals":{},
              "facet_heatmaps":{}}}
          

          So we get count 2 for value "2", i.e. replaced (old) version of the document with id=2 is taken into account when using multi selection facets.

          Show
          Vasiliy Bout added a comment - - edited I developed a small example on how to reproduce this problem with the completely new core with a very simple schema and about 20 documents in the core. First of all, I created a new core with the following schema.xml: <?xml version="1.0" ?> <schema name="basic" version="1.1"> <types> <fieldType name="string" class="solr.StrField" omitNorms="true" indexed="true" stored="true"/> <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0" indexed="true" stored="true"/> </types> <fields> <field name="id" type="string" required="true"/> <field name="foo_s" type="string"/> <field name="bar_s" type="string" docValues="true"/> <field name="foo_i" type="int"/> <field name="bar_i" type="int" docValues="true"/> </fields> <uniqueKey>id</uniqueKey> <solrQueryParser defaultOperator="OR"/> </schema> After that, I generated a set of documents to fill the core with. I launched python interpreter in the terminal and typed the following oneliner: [ {"id":i,"foo_i":i,"bar_i":i,"foo_s":i,"bar_s":i} for i in range(1, 21) ] It gave me a set of 20 documents. This is the same set but slightly formatted to be human readable: [ {'bar_s': 1, 'foo_i': 1, 'bar_i': 1, 'foo_s': 1, 'id': 1}, {'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2}, {'bar_s': 3, 'foo_i': 3, 'bar_i': 3, 'foo_s': 3, 'id': 3}, {'bar_s': 4, 'foo_i': 4, 'bar_i': 4, 'foo_s': 4, 'id': 4}, {'bar_s': 5, 'foo_i': 5, 'bar_i': 5, 'foo_s': 5, 'id': 5}, {'bar_s': 6, 'foo_i': 6, 'bar_i': 6, 'foo_s': 6, 'id': 6}, {'bar_s': 7, 'foo_i': 7, 'bar_i': 7, 'foo_s': 7, 'id': 7}, {'bar_s': 8, 'foo_i': 8, 'bar_i': 8, 'foo_s': 8, 'id': 8}, {'bar_s': 9, 'foo_i': 9, 'bar_i': 9, 'foo_s': 9, 'id': 9}, {'bar_s': 10, 'foo_i': 10, 'bar_i': 10, 'foo_s': 10, 'id': 10}, {'bar_s': 11, 'foo_i': 11, 'bar_i': 11, 'foo_s': 11, 'id': 11}, {'bar_s': 12, 'foo_i': 12, 'bar_i': 12, 'foo_s': 12, 'id': 12}, {'bar_s': 13, 'foo_i': 13, 'bar_i': 13, 'foo_s': 13, 'id': 13}, {'bar_s': 14, 'foo_i': 14, 'bar_i': 14, 'foo_s': 14, 'id': 14}, {'bar_s': 15, 'foo_i': 15, 'bar_i': 15, 'foo_s': 15, 'id': 15}, {'bar_s': 16, 'foo_i': 16, 'bar_i': 16, 'foo_s': 16, 'id': 16}, {'bar_s': 17, 'foo_i': 17, 'bar_i': 17, 'foo_s': 17, 'id': 17}, {'bar_s': 18, 'foo_i': 18, 'bar_i': 18, 'foo_s': 18, 'id': 18}, {'bar_s': 19, 'foo_i': 19, 'bar_i': 19, 'foo_s': 19, 'id': 19}, {'bar_s': 20, 'foo_i': 20, 'bar_i': 20, 'foo_s': 20, 'id': 20} ] After that I opened Solr Admin page in my browser, went to the "Documents" tab of my core and filled the core with the set of documents above. I selected the following parameters: Request-Handler (qt): /update/json ; Document Type: Solr Command (raw XML or JSON) ; Documents set to the above JSON generate in python interpreter. After the Solr core is filled with documents, I add a single document once again, so this document overwrites the previous one: {'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2} Now when I look at the "Overview" tab I see the following statistics: Last Modified: less than a minute ago Num Docs: 20 Max Doc: 21 Heap Memory Usage: -1 Deleted Docs: 1 Version: 7 Segment Count: 2 And at this stage all multi select facet queries give incorrect results. Since all the documents in the core have unique values for all fields, all facet queries should give count 1 for all values for all fields. Simple facet queries return correct results: query is q=*:*&rows=0&facet=true&facet.limit=1&facet.field=foo_s&facet.field=foo_i&facet.field=bar_s&facet.field=bar_i response is { "responseHeader":{"status":0,"QTime":1}, "response":{"numFound":20,"start":0,"docs":[]}, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "foo_s":["1",1], "foo_i":["1",1], "bar_s":["1",1], "bar_i":["1",1] }, "facet_dates":{}, "facet_ranges":{}, "facet_intervals":{}, "facet_heatmaps":{}}} And this is what we get for multi select facet query: query is q=*:*&fq={!tag=a}id:*&rows=0&facet=true&facet.limit=1&facet.field={!ex=a}foo_s&facet.field={!ex=a}foo_i&facet.field={!ex=a}bar_s&facet.field={!ex=a}bar_i response is { "responseHeader":{"status":0,"QTime":2}, "response":{"numFound":20,"start":0,"docs":[]}, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "foo_s":["2",2], "foo_i":["2",2], "bar_s":["2",2], "bar_i":["2",2]}, "facet_dates":{}, "facet_ranges":{}, "facet_intervals":{}, "facet_heatmaps":{}}} So we get count 2 for value "2" , i.e. replaced (old) version of the document with id=2 is taken into account when using multi selection facets.
          Hide
          Hoss Man added a comment -

          ...My summary of the issue was incomplete and did not mention multi-select, which it should have.

          Interesting – i noticed the taged/excluded filters in the original example and definitely tried that when i was trying to reproduce, but I didn't see any change in the results so i didn't include it in my "can't reproduce" example ... i must have either made a mistake somewhere, or tickled an excluded filter code path that doesn't have this bug.

          Show
          Hoss Man added a comment - ...My summary of the issue was incomplete and did not mention multi-select, which it should have. Interesting – i noticed the taged/excluded filters in the original example and definitely tried that when i was trying to reproduce, but I didn't see any change in the results so i didn't include it in my "can't reproduce" example ... i must have either made a mistake somewhere, or tickled an excluded filter code path that doesn't have this bug.
          Hide
          Yonik Seeley added a comment -

          Patch attached, running complete tests now.

          Show
          Yonik Seeley added a comment - Patch attached, running complete tests now.
          Hide
          Yonik Seeley added a comment -

          i noticed the taged/excluded filters in the original example and definitely tried that when i was trying to reproduce

          Try uncached... i.e.

          q={!cache=false}*:*&...
          
          Show
          Yonik Seeley added a comment - i noticed the taged/excluded filters in the original example and definitely tried that when i was trying to reproduce Try uncached... i.e. q={!cache= false }*:*&...
          Hide
          Adrien Grand added a comment -

          Should we only have the bugfix here that applies deleted docs in getDocSet and open another issue to discuss the hasDeletedDocs optimization?

          Show
          Adrien Grand added a comment - Should we only have the bugfix here that applies deleted docs in getDocSet and open another issue to discuss the hasDeletedDocs optimization?
          Hide
          ASF subversion and git services added a comment -

          Commit 1725005 from Yonik Seeley in branch 'dev/trunk'
          [ https://svn.apache.org/r1725005 ]

          SOLR-8496: multi-select faceting and getDocSet(List<Query>) can match deleted docs

          Show
          ASF subversion and git services added a comment - Commit 1725005 from Yonik Seeley in branch 'dev/trunk' [ https://svn.apache.org/r1725005 ] SOLR-8496 : multi-select faceting and getDocSet(List<Query>) can match deleted docs
          Hide
          Yonik Seeley added a comment -

          Crossed messages - I had finished testing and committed by the time I saw this.
          Anyway, I didn't see it as an optimization - I simply wrote it how I would have originally by checking deleted docs in just the case where it was missing.

          Show
          Yonik Seeley added a comment - Crossed messages - I had finished testing and committed by the time I saw this. Anyway, I didn't see it as an optimization - I simply wrote it how I would have originally by checking deleted docs in just the case where it was missing.
          Hide
          ASF subversion and git services added a comment -

          Commit 1725008 from Yonik Seeley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1725008 ]

          SOLR-8496: multi-select faceting and getDocSet(List<Query>) can match deleted docs

          Show
          ASF subversion and git services added a comment - Commit 1725008 from Yonik Seeley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1725008 ] SOLR-8496 : multi-select faceting and getDocSet(List<Query>) can match deleted docs
          Hide
          ASF subversion and git services added a comment -

          Commit 1725010 from Yonik Seeley in branch 'dev/branches/lucene_solr_5_4'
          [ https://svn.apache.org/r1725010 ]

          SOLR-8496: multi-select faceting and getDocSet(List<Query>) can match deleted docs

          Show
          ASF subversion and git services added a comment - Commit 1725010 from Yonik Seeley in branch 'dev/branches/lucene_solr_5_4' [ https://svn.apache.org/r1725010 ] SOLR-8496 : multi-select faceting and getDocSet(List<Query>) can match deleted docs
          Hide
          Adrien Grand added a comment -

          I have concerns that this part of the code is playing with live docs a bit too much: there are several ways that live docs might get applied, which multiplies the risks to have other livedocs-related bugs in the future. I would have rather liked something that just applies live docs all the time in getDocSet or that does if (answer == null) answer = getLiveDocs(); instead of the current pf.hasDeletedDocs = (answer == null).

          Show
          Adrien Grand added a comment - I have concerns that this part of the code is playing with live docs a bit too much: there are several ways that live docs might get applied, which multiplies the risks to have other livedocs-related bugs in the future. I would have rather liked something that just applies live docs all the time in getDocSet or that does if (answer == null) answer = getLiveDocs(); instead of the current pf.hasDeletedDocs = (answer == null) .
          Hide
          ASF subversion and git services added a comment -

          Commit 1725012 from Yonik Seeley in branch 'dev/branches/lucene_solr_5_3'
          [ https://svn.apache.org/r1725012 ]

          SOLR-8496: multi-select faceting and getDocSet(List<Query>) can match deleted docs

          Show
          ASF subversion and git services added a comment - Commit 1725012 from Yonik Seeley in branch 'dev/branches/lucene_solr_5_3' [ https://svn.apache.org/r1725012 ] SOLR-8496 : multi-select faceting and getDocSet(List<Query>) can match deleted docs
          Hide
          Yonik Seeley added a comment -

          The original code was missing the case when liveDocs should be consulted, so I added a patch that consulted liveDocs only in the case when it was needed (when all clauses are uncached).
          If we want to investigate further code cleanups, I think that can be done in another issue what won't hold up the releases.

          Show
          Yonik Seeley added a comment - The original code was missing the case when liveDocs should be consulted, so I added a patch that consulted liveDocs only in the case when it was needed (when all clauses are uncached). If we want to investigate further code cleanups, I think that can be done in another issue what won't hold up the releases.
          Hide
          Joel Bernstein added a comment - - edited

          Yonik Seeley, can you provide a quick summary of the issue. I see lot's of symptoms in the ticket but I don't see the details of the inner workings of the bug.

          I'm concerned this bug may be hitting us in many different places besides facets, such as field collapsing, and exporting.

          Show
          Joel Bernstein added a comment - - edited Yonik Seeley , can you provide a quick summary of the issue. I see lot's of symptoms in the ticket but I don't see the details of the inner workings of the bug. I'm concerned this bug may be hitting us in many different places besides facets, such as field collapsing, and exporting.
          Hide
          Joel Bernstein added a comment -

          One of the things that is mentioned in this ticket is that the doc counts in the schema browser were also effected. I'm wondering how far this bug reaches.

          Show
          Joel Bernstein added a comment - One of the things that is mentioned in this ticket is that the doc counts in the schema browser were also effected. I'm wondering how far this bug reaches.
          Hide
          Yonik Seeley added a comment - - edited

          I'm concerned this bug may be hitting us in many different places besides facets, such as field collapsing, and exporting.

          Indeed. We may still be vulnerable , but not due to this bug in particular.

          The change in general was LUCENE-6553, and that may yet cause bugs (like this one) in different areas.
          Deleted docs are now only screened out before hitting the Collector. So any place that does something lower level, like Weight.scorer(), is vulnerable if used in a context was was expecting only live docs.

          This specific bug:
          The DocSet returned from SolrIndexSearcher.getDocSet(List<Query>) could contain deleted documents (and that breaks our current invariants that DocSets never contain deleted docs).
          LUCENE-6553 changed (among many others) this line:
          https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L2473
          We used to pass liveDocs at that point, but that method signature was removed.
          So now, if all clauses to be intersected are uncached, then Weight.scorer() is used for all of them and the intersection can thus still contain deleted docs. If even one clause is a normal DocSet, we're good since they do reflect liveDocs.

          So the fix was, detect the case where all clauses are uncached (i.e. will use Weight.scorer) and check liveDocs in that specific case.

          Show
          Yonik Seeley added a comment - - edited I'm concerned this bug may be hitting us in many different places besides facets, such as field collapsing, and exporting. Indeed. We may still be vulnerable , but not due to this bug in particular. The change in general was LUCENE-6553 , and that may yet cause bugs (like this one) in different areas. Deleted docs are now only screened out before hitting the Collector. So any place that does something lower level, like Weight.scorer(), is vulnerable if used in a context was was expecting only live docs. This specific bug: The DocSet returned from SolrIndexSearcher.getDocSet(List<Query>) could contain deleted documents (and that breaks our current invariants that DocSets never contain deleted docs). LUCENE-6553 changed (among many others) this line: https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L2473 We used to pass liveDocs at that point, but that method signature was removed. So now, if all clauses to be intersected are uncached, then Weight.scorer() is used for all of them and the intersection can thus still contain deleted docs. If even one clause is a normal DocSet, we're good since they do reflect liveDocs. So the fix was, detect the case where all clauses are uncached (i.e. will use Weight.scorer) and check liveDocs in that specific case.
          Hide
          Yonik Seeley added a comment -

          I'm concerned this bug may be hitting us in many different places besides facets, such as field collapsing, and exporting.

          If you're wondering about post filters, I think it depends on how they get their docs.
          If it's through the normal mechanism (search(query, collector)) then we're good. Lucene filters out deleted docs before they hit the collector.
          If you ever feed a collector yourself, you need to ensure that it isn't fed deleted docs.

          Show
          Yonik Seeley added a comment - I'm concerned this bug may be hitting us in many different places besides facets, such as field collapsing, and exporting. If you're wondering about post filters, I think it depends on how they get their docs. If it's through the normal mechanism (search(query, collector)) then we're good. Lucene filters out deleted docs before they hit the collector. If you ever feed a collector yourself, you need to ensure that it isn't fed deleted docs.
          Hide
          Joel Bernstein added a comment -

          One candidate I can think of right off is the HashQParserPlugin which handles shuffling for the streaming API. I'll review that code. The schema browser bug reported in this issue is also important to track down.

          Show
          Joel Bernstein added a comment - One candidate I can think of right off is the HashQParserPlugin which handles shuffling for the streaming API. I'll review that code. The schema browser bug reported in this issue is also important to track down.
          Hide
          Yonik Seeley added a comment -

          After that you can see, that in "Schema Browser" when you select your field and press "Load Term Info" counts for your field are incorrect (they take into account also old versions of the overwritten documents).

          Either behavior could be correct, so we should first verify that this behavior has changed (i.e. did 5.2 take into account deleted documents?)
          Not that for tull-text scoring, term statistics like ttf, idf, etc, do not take deletions into account.

          Show
          Yonik Seeley added a comment - After that you can see, that in "Schema Browser" when you select your field and press "Load Term Info" counts for your field are incorrect (they take into account also old versions of the overwritten documents). Either behavior could be correct, so we should first verify that this behavior has changed (i.e. did 5.2 take into account deleted documents?) Not that for tull-text scoring, term statistics like ttf, idf, etc, do not take deletions into account.
          Hide
          ASF subversion and git services added a comment -

          Commit 7b42653a274962f50661f4bed52dad298f7064d5 in lucene-solr's branch refs/heads/branch_5_4 from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7b42653 ]

          SOLR-8496: multi-select faceting and getDocSet(List<Query>) can match deleted docs

          git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_4@1725010 13f79535-47bb-0310-9956-ffa450edef68

          Show
          ASF subversion and git services added a comment - Commit 7b42653a274962f50661f4bed52dad298f7064d5 in lucene-solr's branch refs/heads/branch_5_4 from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7b42653 ] SOLR-8496 : multi-select faceting and getDocSet(List<Query>) can match deleted docs git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_4@1725010 13f79535-47bb-0310-9956-ffa450edef68

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Andreas Müller
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development