Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.5
    • Fix Version/s: 4.9, 5.0
    • Component/s: search
    • Labels:
      None

      Description

      Distributed IDF is a valuable enhancement for distributed search across non-uniform shards. This issue tracks the proposed implementation of an API to support this functionality in Solr.

      1. distrib.patch
        34 kB
        Andrzej Bialecki
      2. distrib-2.patch
        106 kB
        Andrzej Bialecki
      3. SOLR-1632.patch
        93 kB
        Mark Miller
      4. SOLR-1632.patch
        79 kB
        Andrzej Bialecki
      5. 3x_SOLR-1632_doesntwork.patch
        77 kB
        Shawn Heisey
      6. SOLR-1632.patch
        80 kB
        Andrzej Bialecki
      7. SOLR-1632.patch
        77 kB
        Markus Jelsma
      8. SOLR-1632.patch
        78 kB
        Markus Jelsma
      9. SOLR-1632.patch
        79 kB
        Markus Jelsma
      10. SOLR-1632.patch
        84 kB
        Mark Miller
      11. SOLR-1632.patch
        83 kB
        Mark Miller
      12. SOLR-1632.patch
        87 kB
        Mark Miller
      13. SOLR-1632.patch
        80 kB
        Mark Miller
      14. SOLR-1632.patch
        82 kB
        Mark Miller
      15. SOLR-1632.patch
        86 kB
        Vitaliy Zhovtyuk
      16. SOLR-5488.patch
        86 kB
        Vitaliy Zhovtyuk

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          Initial implementation. This supports the current global IDF (i.e. none ), and an exact version of global IDF that requires one additional request per query to obtain per-shard stats.

          The design should be already flexible enough to implement LRU caching of docFreqs, and ultimately to implement other methods for global IDF calculation (e.g. based on estimation or re-ranking).

          Show
          Andrzej Bialecki added a comment - Initial implementation. This supports the current global IDF (i.e. none ), and an exact version of global IDF that requires one additional request per query to obtain per-shard stats. The design should be already flexible enough to implement LRU caching of docFreqs, and ultimately to implement other methods for global IDF calculation (e.g. based on estimation or re-ranking).
          Hide
          Otis Gospodnetic added a comment -
          Show
          Otis Gospodnetic added a comment - What about this approach: http://markmail.org/message/mjfmpzfspguepixx ?
          Hide
          Andrzej Bialecki added a comment -

          I'm not sure what approach you are referring to. Following the terminology in that thread, this implementation follows the approach where there is a single merged big idf map at the master, and it's sent out to slaves on each query. However, when exactly this merging and sending happens is implementation-specific - in the ExactDFSource it happens on every query, but I hope the API can support other scenarios as well.

          Show
          Andrzej Bialecki added a comment - I'm not sure what approach you are referring to. Following the terminology in that thread, this implementation follows the approach where there is a single merged big idf map at the master, and it's sent out to slaves on each query. However, when exactly this merging and sending happens is implementation-specific - in the ExactDFSource it happens on every query, but I hope the API can support other scenarios as well.
          Hide
          Otis Gospodnetic added a comment -

          I didn't look a the patch, but from your comments it looks like you already have that "1 merged big idf map", which is really what I was aiming at, so that's good!

          I was just thinking that this map (file) would be periodically updated and pushed to slaves, so that slaves can compute the global IDF locally instead of any kind of extra requests.

          Show
          Otis Gospodnetic added a comment - I didn't look a the patch, but from your comments it looks like you already have that "1 merged big idf map", which is really what I was aiming at, so that's good! I was just thinking that this map (file) would be periodically updated and pushed to slaves, so that slaves can compute the global IDF locally instead of any kind of extra requests.
          Hide
          Andrzej Bialecki added a comment -

          I believe the API that I propose would support such implementation as well. Please note that it's usually not feasible to compute and distribute the complete IDF table for all terms - you would have to replicate a union of all term dictionaries across the cluster. In practice, you limit the amount of information by various means, e.g. only distributing data related to the current request (this implementation) or reducing the frequency of updates (e.g. LRU caching), or approximating global DF with a constant for frequent terms (where the contribution of their IDF to the score would be negligible anyway).

          Show
          Andrzej Bialecki added a comment - I believe the API that I propose would support such implementation as well. Please note that it's usually not feasible to compute and distribute the complete IDF table for all terms - you would have to replicate a union of all term dictionaries across the cluster. In practice, you limit the amount of information by various means, e.g. only distributing data related to the current request (this implementation) or reducing the frequency of updates (e.g. LRU caching), or approximating global DF with a constant for frequent terms (where the contribution of their IDF to the score would be negligible anyway).
          Hide
          Marc Sturlese added a comment -

          Wich should be the value of the parameter shard.purpose to enable or disable the exact version of global IDF?

          Show
          Marc Sturlese added a comment - Wich should be the value of the parameter shard.purpose to enable or disable the exact version of global IDF?
          Hide
          Andrzej Bialecki added a comment -

          Shard.purpose is set by a concrete implementation of the DFSource, so I guess your question is "how to turn ExactDFSource on/off"? If that's the case, then put this in your solrconfig.xml:

          <globalIDF class="org.apache.solr.search.ExactDFCache"/>
          
          Show
          Andrzej Bialecki added a comment - Shard.purpose is set by a concrete implementation of the DFSource, so I guess your question is "how to turn ExactDFSource on/off"? If that's the case, then put this in your solrconfig.xml: <globalIDF class= "org.apache.solr.search.ExactDFCache" />
          Hide
          Andrzej Bialecki added a comment -

          Updated patch, contains also:

          • LRU-based cache that optimizes requests using cached values of docFreq for known terms
          • unit tests
          Show
          Andrzej Bialecki added a comment - Updated patch, contains also: LRU-based cache that optimizes requests using cached values of docFreq for known terms unit tests
          Hide
          Yonik Seeley added a comment -

          Was looking into this a little offline with Mark, who noticed that some queries were not being rewritten, and would thus throw an exception during weighting.

          It looks like the issue is this: rewrite() doesn't work for function queries (there is no propagation mechanism to go through value sources). This is a problem when real queries are embedded in function queries.

          Solr Function queries do have a mechanism to weight (via ValueSource.createWeight()).
          QueryValueSource does "Weight w = q.weight(searcher);" and that implementation of weight
          calls "Query query = searcher.rewrite(this);"

          This patch calls rewrite explicitly (which does nothing for embedded queries), and then when using the DFSource implementation of searcher, rewrite does nothing, and hence the embedded query is never rewritten and the subsequent createWeight() throws an exception.

          Show
          Yonik Seeley added a comment - Was looking into this a little offline with Mark, who noticed that some queries were not being rewritten, and would thus throw an exception during weighting. It looks like the issue is this: rewrite() doesn't work for function queries (there is no propagation mechanism to go through value sources). This is a problem when real queries are embedded in function queries. Solr Function queries do have a mechanism to weight (via ValueSource.createWeight()). QueryValueSource does "Weight w = q.weight(searcher);" and that implementation of weight calls "Query query = searcher.rewrite(this);" This patch calls rewrite explicitly (which does nothing for embedded queries), and then when using the DFSource implementation of searcher, rewrite does nothing, and hence the embedded query is never rewritten and the subsequent createWeight() throws an exception.
          Hide
          Yonik Seeley added a comment -

          Rewrite not working through function query is not the end of the problems either... there is also stuff like extractTerms.

          There is also the issue of Lucene changing rapidly... and the difficulty of adding new methods to ValueSource and making sure that all implementations correctly propagate them through to sub ValueSources. Perhaps one idea is to use a visitor pattern to decouple tree traversal with the operations being performed.

          Show
          Yonik Seeley added a comment - Rewrite not working through function query is not the end of the problems either... there is also stuff like extractTerms. There is also the issue of Lucene changing rapidly... and the difficulty of adding new methods to ValueSource and making sure that all implementations correctly propagate them through to sub ValueSources. Perhaps one idea is to use a visitor pattern to decouple tree traversal with the operations being performed.
          Hide
          LiLi added a comment -

          My solr version is 1.4. I patched it but failed.
          SolrCache<String, Integer> cache = perShardCache.get(shard); it suggests that "The type SolrCache is not generic; it cannot be parameterized with arguments <String, Integer>"

          The SolrCache is a interface: public interface SolrCache extends SolrInfoMBean

          patching file src/common/org/apache/solr/common/params/ShardParams.java
          patching file src/java/org/apache/solr/core/SolrConfig.java
          Hunk #1 succeeded at 30 with fuzz 2 (offset 2 lines).
          Hunk #2 FAILED at 197.
          1 out of 2 hunks FAILED – saving rejects to file src/java/org/apache/solr/core/
          SolrConfig.java.rej
          patching file src/java/org/apache/solr/core/SolrCore.java
          Hunk #5 succeeded at 821 (offset 3 lines).
          patching file src/java/org/apache/solr/handler/component/QueryComponent.java
          Hunk #1 succeeded at 40 with fuzz 2 (offset -2 lines).
          Hunk #6 succeeded at 302 (offset 13 lines).
          Hunk #7 succeeded at 324 with fuzz 2 (offset 12 lines).
          Hunk #8 succeeded at 343 (offset 21 lines).
          Hunk #9 succeeded at 367 (offset 21 lines).
          Hunk #10 succeeded at 423 (offset 28 lines).
          patching file src/java/org/apache/solr/handler/component/SearchHandler.java
          patching file src/java/org/apache/solr/handler/component/ShardRequest.java
          Hunk #1 FAILED at 37.
          1 out of 1 hunk FAILED – saving rejects to file src/java/org/apache/solr/handle
          r/component/ShardRequest.java.rej
          patching file src/java/org/apache/solr/search/DFCache.java
          patching file src/java/org/apache/solr/search/DFSource.java
          patching file src/java/org/apache/solr/search/DefaultDFCache.java
          patching file src/java/org/apache/solr/search/ExactDFCache.java
          patching file src/java/org/apache/solr/search/LRUDFCache.java
          patching file src/java/org/apache/solr/search/SolrIndexSearcher.java
          Hunk #1 succeeded at 77 (offset 3 lines).
          Hunk #2 succeeded at 149 (offset 3 lines).
          Hunk #3 succeeded at 699 (offset 46 lines).
          Hunk #4 succeeded at 927 (offset 59 lines).
          Hunk #5 succeeded at 1041 (offset 59 lines).
          Hunk #6 succeeded at 1190 with fuzz 1 (offset 180 lines).
          Hunk #7 FAILED at 1276.
          Hunk #8 FAILED at 1311.
          Hunk #9 succeeded at 1608 (offset 104 lines).
          Hunk #10 succeeded at 1716 (offset 113 lines).
          Hunk #11 succeeded at 1774 (offset 113 lines).
          2 out of 11 hunks FAILED – saving rejects to file src/java/org/apache/solr/sear
          ch/SolrIndexSearcher.java.rej
          patching file src/java/org/apache/solr/util/SolrPluginUtils.java
          can't find file to patch at input line 1206
          Perhaps you used the wrong -p or --strip option?
          The text leading up to this was:
          --------------------------

          Index: trunk/src/test/org/apache/solr/BaseDistributedSearchTestCase.java
          ===============================================================
          — trunk/src/test/org/apache/solr/BaseDistributedSearchTestCase.java (revisio
          n 893413)
          +++ trunk/src/test/org/apache/solr/BaseDistributedSearchTestCase.java (working
          copy)
          --------------------------
          File to patch:
          Skip this patch? [y] n
          File to patch:
          Skip this patch? [y]
          Skipping patch.
          4 out of 4 hunks ignored
          patching file src/test/org/apache/solr/search/TestDefaultDFCache.java
          patching file src/test/org/apache/solr/search/TestExactDFCache.java
          patching file src/test/org/apache/solr/search/TestLRUDFCache.java
          patching file src/test/test-files/solr/conf/solrconfig-defaultdfcache.xml
          patching file src/test/test-files/solr/conf/solrconfig-exactdfcache.xml
          patching file src/test/test-files/solr/conf/solrconfig-lrudfcache.xml
          Show
          LiLi added a comment - My solr version is 1.4. I patched it but failed. SolrCache<String, Integer> cache = perShardCache.get(shard); it suggests that "The type SolrCache is not generic; it cannot be parameterized with arguments <String, Integer>" The SolrCache is a interface: public interface SolrCache extends SolrInfoMBean patching file src/common/org/apache/solr/common/params/ShardParams.java patching file src/java/org/apache/solr/core/SolrConfig.java Hunk #1 succeeded at 30 with fuzz 2 (offset 2 lines). Hunk #2 FAILED at 197. 1 out of 2 hunks FAILED – saving rejects to file src/java/org/apache/solr/core/ SolrConfig.java.rej patching file src/java/org/apache/solr/core/SolrCore.java Hunk #5 succeeded at 821 (offset 3 lines). patching file src/java/org/apache/solr/handler/component/QueryComponent.java Hunk #1 succeeded at 40 with fuzz 2 (offset -2 lines). Hunk #6 succeeded at 302 (offset 13 lines). Hunk #7 succeeded at 324 with fuzz 2 (offset 12 lines). Hunk #8 succeeded at 343 (offset 21 lines). Hunk #9 succeeded at 367 (offset 21 lines). Hunk #10 succeeded at 423 (offset 28 lines). patching file src/java/org/apache/solr/handler/component/SearchHandler.java patching file src/java/org/apache/solr/handler/component/ShardRequest.java Hunk #1 FAILED at 37. 1 out of 1 hunk FAILED – saving rejects to file src/java/org/apache/solr/handle r/component/ShardRequest.java.rej patching file src/java/org/apache/solr/search/DFCache.java patching file src/java/org/apache/solr/search/DFSource.java patching file src/java/org/apache/solr/search/DefaultDFCache.java patching file src/java/org/apache/solr/search/ExactDFCache.java patching file src/java/org/apache/solr/search/LRUDFCache.java patching file src/java/org/apache/solr/search/SolrIndexSearcher.java Hunk #1 succeeded at 77 (offset 3 lines). Hunk #2 succeeded at 149 (offset 3 lines). Hunk #3 succeeded at 699 (offset 46 lines). Hunk #4 succeeded at 927 (offset 59 lines). Hunk #5 succeeded at 1041 (offset 59 lines). Hunk #6 succeeded at 1190 with fuzz 1 (offset 180 lines). Hunk #7 FAILED at 1276. Hunk #8 FAILED at 1311. Hunk #9 succeeded at 1608 (offset 104 lines). Hunk #10 succeeded at 1716 (offset 113 lines). Hunk #11 succeeded at 1774 (offset 113 lines). 2 out of 11 hunks FAILED – saving rejects to file src/java/org/apache/solr/sear ch/SolrIndexSearcher.java.rej patching file src/java/org/apache/solr/util/SolrPluginUtils.java can't find file to patch at input line 1206 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- Index: trunk/src/test/org/apache/solr/BaseDistributedSearchTestCase.java =============================================================== — trunk/src/test/org/apache/solr/BaseDistributedSearchTestCase.java (revisio n 893413) +++ trunk/src/test/org/apache/solr/BaseDistributedSearchTestCase.java (working copy) -------------------------- File to patch: Skip this patch? [y] n File to patch: Skip this patch? [y] Skipping patch. 4 out of 4 hunks ignored patching file src/test/org/apache/solr/search/TestDefaultDFCache.java patching file src/test/org/apache/solr/search/TestExactDFCache.java patching file src/test/org/apache/solr/search/TestLRUDFCache.java patching file src/test/test-files/solr/conf/solrconfig-defaultdfcache.xml patching file src/test/test-files/solr/conf/solrconfig-exactdfcache.xml patching file src/test/test-files/solr/conf/solrconfig-lrudfcache.xml
          Hide
          Thorsten Scherler added a comment -

          Regarding the comment "Perhaps one idea is to use a visitor pattern to decouple tree traversal with the operations being performed." can you please explain where to implement the Listener/visitor. I had a quick look at the patch and it seems to me that the main functionality is in trunk/src/java/org/apache/solr/search/SolrIndexSearcher.java and the rest is more caching concerns, right?

          Show
          Thorsten Scherler added a comment - Regarding the comment "Perhaps one idea is to use a visitor pattern to decouple tree traversal with the operations being performed." can you please explain where to implement the Listener/visitor. I had a quick look at the patch and it seems to me that the main functionality is in trunk/src/java/org/apache/solr/search/SolrIndexSearcher.java and the rest is more caching concerns, right?
          Hide
          Mark Miller added a comment -

          Recently I updated this patch to trunk and got rid of the threadlocal usage and Query rewriting that was the reason we had to pull this from trunk long ago - then I attempted to override stats on IndexSearcher with global stats - this is when I realized that had no affect on scoring anymore - this will now be addressed LUCENE-3555. Unfortunately, I didn't pay attention and lost that code. It's unfortunate, because it would have been a nice head start on this issue - I think we may want to make other changes/improvements, but would have been a start with something working. It was a half pain to do since the patch has to be manually applied, but perhaps doing it a second time is faster...

          Show
          Mark Miller added a comment - Recently I updated this patch to trunk and got rid of the threadlocal usage and Query rewriting that was the reason we had to pull this from trunk long ago - then I attempted to override stats on IndexSearcher with global stats - this is when I realized that had no affect on scoring anymore - this will now be addressed LUCENE-3555 . Unfortunately, I didn't pay attention and lost that code. It's unfortunate, because it would have been a nice head start on this issue - I think we may want to make other changes/improvements, but would have been a start with something working. It was a half pain to do since the patch has to be manually applied, but perhaps doing it a second time is faster...
          Hide
          Mark Miller added a comment -

          Correction: i got rid of the rewrite that was added for the multi searcher type behavior - I hadn't solved the issue of rewrite to get the terms to retrieve stats for - that patch was not yet going to work with multiterm queries.

          Show
          Mark Miller added a comment - Correction: i got rid of the rewrite that was added for the multi searcher type behavior - I hadn't solved the issue of rewrite to get the terms to retrieve stats for - that patch was not yet going to work with multiterm queries.
          Hide
          Mark Miller added a comment -

          Although, actually I'm not even sure if that rewrite is really a problem - I almost don't think it will tickle the same issue as the rewrite that was happening before the search. I didn't have a chance to test it or look into it in depth or anything yet though.

          Show
          Mark Miller added a comment - Although, actually I'm not even sure if that rewrite is really a problem - I almost don't think it will tickle the same issue as the rewrite that was happening before the search. I didn't have a chance to test it or look into it in depth or anything yet though.
          Hide
          Mark Miller added a comment -

          I found this work hidden away in my eclipse workspace! It still has the thread local stuff - either I had only thought of what I was going to do to remove it, or this was not the latest work, but either way, it starts us from a trunk applyable patch, which is much better. There is still a fair amount to do at minimum to switch to using the new scoring stats. I started some really simple moves towards this (super baby step) and so things dont compile at the moment. Patch should be clean though.

          Show
          Mark Miller added a comment - I found this work hidden away in my eclipse workspace! It still has the thread local stuff - either I had only thought of what I was going to do to remove it, or this was not the latest work, but either way, it starts us from a trunk applyable patch, which is much better. There is still a fair amount to do at minimum to switch to using the new scoring stats. I started some really simple moves towards this (super baby step) and so things dont compile at the moment. Patch should be clean though.
          Hide
          Andrzej Bialecki added a comment -

          Patch updated to trunk (rev. 1232110). I refactored the code and changed the names of new classes to better reflect the fact that we work with complex stats and not primitive freqs. Included unit tests are passing.

          Show
          Andrzej Bialecki added a comment - Patch updated to trunk (rev. 1232110). I refactored the code and changed the names of new classes to better reflect the fact that we work with complex stats and not primitive freqs. Included unit tests are passing.
          Hide
          Shawn Heisey added a comment -

          Is this something that can be added to branch_3x? With high fuzz and ignore whitespace, the patch applies, but then fails to compile. It also fails to compile when I set fuzz to zero, pay attention to whitespace, and manually fix the patch rejects. I couldn't figure out how to fix the problems.

          Show
          Shawn Heisey added a comment - Is this something that can be added to branch_3x? With high fuzz and ignore whitespace, the patch applies, but then fails to compile. It also fails to compile when I set fuzz to zero, pay attention to whitespace, and manually fix the patch rejects. I couldn't figure out how to fix the problems.
          Hide
          Andrzej Bialecki added a comment -

          Is this something that can be added to branch_3x?

          Not without porting - Lucene / Solr API-s have changed significantly, and this patch uses some low-level API-s that are different between trunk and 3x.

          Show
          Andrzej Bialecki added a comment - Is this something that can be added to branch_3x? Not without porting - Lucene / Solr API-s have changed significantly, and this patch uses some low-level API-s that are different between trunk and 3x.
          Hide
          Yonik Seeley added a comment -

          Haven't had time to look this over that closely, but this did jump out at me:

          +public class CollectionStats {
          + public String field;
          + public int maxDoc;
          + public int docCount;

          Shouldn't we be using longs here so we can support more than 2B docs?

          Show
          Yonik Seeley added a comment - Haven't had time to look this over that closely, but this did jump out at me: +public class CollectionStats { + public String field; + public int maxDoc; + public int docCount; Shouldn't we be using longs here so we can support more than 2B docs?
          Hide
          Andrzej Bialecki added a comment -

          Yeah, I was curious about this too. However, this is how CollectionStatistics is defined in Lucene, so it's something that we have to change in Lucene too.

          Show
          Andrzej Bialecki added a comment - Yeah, I was curious about this too. However, this is how CollectionStatistics is defined in Lucene, so it's something that we have to change in Lucene too.
          Hide
          Shawn Heisey added a comment -

          This is a diff from my best approximation of applying the trunk patch to 3x. It doesn't compile, but it will probably save someone some time.

          Show
          Shawn Heisey added a comment - This is a diff from my best approximation of applying the trunk patch to 3x. It doesn't compile, but it will probably save someone some time.
          Hide
          Robert Muir added a comment -

          However, this is how CollectionStatistics is defined in Lucene, so it's something that we have to change in Lucene too.

          TermStatistics too. Lets open a separate issue for this.

          Show
          Robert Muir added a comment - However, this is how CollectionStatistics is defined in Lucene, so it's something that we have to change in Lucene too. TermStatistics too. Lets open a separate issue for this.
          Hide
          Andrzej Bialecki added a comment -

          Patch updated to use long types, and properly handle -1's in freqs.

          Show
          Andrzej Bialecki added a comment - Patch updated to use long types, and properly handle -1's in freqs.
          Hide
          Robert Muir added a comment -

          Thanks Andrzej: I think it will be nice that all of lucene's scoring algorithms can work in distributed mode.

          Just one question about the patch: in StatsUtil I can't tell if termFromString matches termToString?
          termToString seems to base64 encode the term text (a good idea, since terms can be binary), but I don't
          see the corresponding decode in termFromString (there is an XXX: comment though).

          Show
          Robert Muir added a comment - Thanks Andrzej: I think it will be nice that all of lucene's scoring algorithms can work in distributed mode. Just one question about the patch: in StatsUtil I can't tell if termFromString matches termToString? termToString seems to base64 encode the term text (a good idea, since terms can be binary), but I don't see the corresponding decode in termFromString (there is an XXX: comment though).
          Hide
          Andrzej Bialecki added a comment -

          Hmm, indeed... I must have switched to toString() for debugging (its easier to eyeball an ascii string than a base64 string ). This should use base64 throughout. I'll prepare a patch shortly.

          (BTW, I'm aware that passing around blobs of base64 inside SolrParams is ugly. I'm open to suggestions how to handle this better).

          Show
          Andrzej Bialecki added a comment - Hmm, indeed... I must have switched to toString() for debugging (its easier to eyeball an ascii string than a base64 string ). This should use base64 throughout. I'll prepare a patch shortly. (BTW, I'm aware that passing around blobs of base64 inside SolrParams is ugly. I'm open to suggestions how to handle this better).
          Hide
          Yonik Seeley added a comment -

          (BTW, I'm aware that passing around blobs of base64 inside SolrParams is ugly. I'm open to suggestions how to handle this better).

          I'd prefer non-base64 at the Solr transport level (e.g. termStats=how,now,brown,cow). It will be both smaller, and much easier to debug other things.

          Although Lucene can technically index arbitrary binary now, Solr does not use that anywhere (and won't for 4.0). It would take a good amount of infrastructure work all over to truly allow that. If/when we allow arbitrary binary terms, it should be relatively easy to extend the syntax we pick today to allow selectively base64 encoded terms.

          There are already a number of places in Solr where we use StrUtil.join (a comma separated list of strings) to specify a list of terms (both in distrib faceting and distrib search for example).

          Show
          Yonik Seeley added a comment - (BTW, I'm aware that passing around blobs of base64 inside SolrParams is ugly. I'm open to suggestions how to handle this better). I'd prefer non-base64 at the Solr transport level (e.g. termStats=how,now,brown,cow). It will be both smaller, and much easier to debug other things. Although Lucene can technically index arbitrary binary now, Solr does not use that anywhere (and won't for 4.0). It would take a good amount of infrastructure work all over to truly allow that. If/when we allow arbitrary binary terms, it should be relatively easy to extend the syntax we pick today to allow selectively base64 encoded terms. There are already a number of places in Solr where we use StrUtil.join (a comma separated list of strings) to specify a list of terms (both in distrib faceting and distrib search for example).
          Hide
          Robert Muir added a comment -

          Although Lucene can technically index arbitrary binary now, Solr does not use that anywhere (and won't for 4.0).

          Thats not actually true. Collation uses it already.

          Show
          Robert Muir added a comment - Although Lucene can technically index arbitrary binary now, Solr does not use that anywhere (and won't for 4.0). Thats not actually true. Collation uses it already.
          Hide
          Yonik Seeley added a comment -

          Thats not actually true. Collation uses it already.

          Hmmm, that's normally just for sorting though. I wonder if that works with distributed search today?

          Anyway, we have a schema - that can allow us to do what makes sense depending on the field (i.e. only use base64 or \x?? for fields where there will be non-character terms)

          Show
          Yonik Seeley added a comment - Thats not actually true. Collation uses it already. Hmmm, that's normally just for sorting though. I wonder if that works with distributed search today? Anyway, we have a schema - that can allow us to do what makes sense depending on the field (i.e. only use base64 or \x?? for fields where there will be non-character terms)
          Hide
          Robert Muir added a comment -

          Its also used for locale-sensitive range queries (and of course termquery etc works too, but thats not interesting).

          Show
          Robert Muir added a comment - Its also used for locale-sensitive range queries (and of course termquery etc works too, but thats not interesting).
          Hide
          Andrzej Bialecki added a comment -

          \x or %xx escaping could be ok, I guess - it's safe, and in most cases it's still readable, unlike base64.

          Show
          Andrzej Bialecki added a comment - \x or %xx escaping could be ok, I guess - it's safe, and in most cases it's still readable, unlike base64.
          Hide
          Yonik Seeley added a comment -

          Its also used for locale-sensitive range queries

          Given that range queries (and other multi-term queries) are constant scoring and may contain many terms, hopefully we avoid requesting term stats for these?

          Show
          Yonik Seeley added a comment - Its also used for locale-sensitive range queries Given that range queries (and other multi-term queries) are constant scoring and may contain many terms, hopefully we avoid requesting term stats for these?
          Hide
          Andrzej Bialecki added a comment -

          hopefully we avoid requesting term stats for these?

          There is no provision for this yet in the current patch.

          Show
          Andrzej Bialecki added a comment - hopefully we avoid requesting term stats for these? There is no provision for this yet in the current patch.
          Hide
          Robert Muir added a comment -

          There is no provision for this yet in the current patch.

          There is nothing different from a MTQ generated BQ than a huge BQ a solr user submits.
          In my opinion instead of saying "screw scoring certain types of queries", this stuff should
          be done by InExact implementations (and maybe that should be the default, fine). e.g. a nice
          heuristic could look at the local stats and say: sure there are 100 terms but 50 are low-freq,
          lets assume additive constant C for those, batch the other terms into e.g. 5 ranges and only request
          stats on 5 "surrogate" terms representative of those groups.

          Just make sure any heuristic is always added to what is surely present locally, e.g. distributed
          docfreq is always >= local docfreq. Then no scoring algorithms will break.

          Show
          Robert Muir added a comment - There is no provision for this yet in the current patch. There is nothing different from a MTQ generated BQ than a huge BQ a solr user submits. In my opinion instead of saying "screw scoring certain types of queries", this stuff should be done by InExact implementations (and maybe that should be the default, fine). e.g. a nice heuristic could look at the local stats and say: sure there are 100 terms but 50 are low-freq, lets assume additive constant C for those, batch the other terms into e.g. 5 ranges and only request stats on 5 "surrogate" terms representative of those groups. Just make sure any heuristic is always added to what is surely present locally, e.g. distributed docfreq is always >= local docfreq. Then no scoring algorithms will break.
          Hide
          Yonik Seeley added a comment -

          There is nothing different from a MTQ generated BQ than a huge BQ a solr user submits.

          Multi-term queries like range query, prefix query, etc, do not depend on term stats, and can consist of millions of terms. It's a waste to attempt to return term stats for them (estimated or not).

          It would also be a shame to use estimates rather than exact numbers for what will be the common case (i.e. when there's really only a couple of terms you need stats for):
          +title:"blue whale" +title_whole:[a TO g}
          or
          +title:"blue whale" +date:[2001-01-01 TO 2010-01-01}

          Ideally, we wouldn't even do a rewrite in order to collect terms - rewrite itself has gotten much more expensive in some circumstances (i.e. iterating the first 350 terms to determine what style of rewrite should be used)

          Show
          Yonik Seeley added a comment - There is nothing different from a MTQ generated BQ than a huge BQ a solr user submits. Multi-term queries like range query, prefix query, etc, do not depend on term stats, and can consist of millions of terms. It's a waste to attempt to return term stats for them (estimated or not). It would also be a shame to use estimates rather than exact numbers for what will be the common case (i.e. when there's really only a couple of terms you need stats for): +title:"blue whale" +title_whole:[a TO g} or +title:"blue whale" +date:[2001-01-01 TO 2010-01-01} Ideally, we wouldn't even do a rewrite in order to collect terms - rewrite itself has gotten much more expensive in some circumstances (i.e. iterating the first 350 terms to determine what style of rewrite should be used)
          Hide
          Robert Muir added a comment -

          Multi-term queries like range query, prefix query, etc, do not depend on term stats, and can consist of millions of terms.

          No, they cannot.

          it can't be millions of terms because a million exceeds the
          boolean max clause count, in which it will always use a filter.

          Ideally, we wouldn't even do a rewrite in order to collect terms

          You don't have to, Lucene's test case (ShardSearchingTestBase) doesn't do an extra rewrite to collect terms.

          @Override
          public Query rewrite(Query original) throws IOException {
            final Query rewritten = super.rewrite(original);
            final Set<Term> terms = new HashSet<Term>();
            rewritten.extractTerms(terms);
          
            // Make a single request to remote nodes for term
            // stats:
            ...
            return rewritten;
          }
          
          • rewrite itself has gotten much more expensive in some circumstances (i.e. iterating the first 350 terms to determine what style of rewrite should be used)

          Got any benchmarks to back this up with?

          Its incorrect to say rewrite has gotten more expensive? More expensive than what?
          Its the opposite: its actually much faster when rewriting to boolean queries in 4.0 because it always works per-segment.

          Show
          Robert Muir added a comment - Multi-term queries like range query, prefix query, etc, do not depend on term stats, and can consist of millions of terms. No, they cannot. it can't be millions of terms because a million exceeds the boolean max clause count, in which it will always use a filter. Ideally, we wouldn't even do a rewrite in order to collect terms You don't have to, Lucene's test case (ShardSearchingTestBase) doesn't do an extra rewrite to collect terms. @Override public Query rewrite(Query original) throws IOException { final Query rewritten = super .rewrite(original); final Set<Term> terms = new HashSet<Term>(); rewritten.extractTerms(terms); // Make a single request to remote nodes for term // stats: ... return rewritten; } rewrite itself has gotten much more expensive in some circumstances (i.e. iterating the first 350 terms to determine what style of rewrite should be used) Got any benchmarks to back this up with? Its incorrect to say rewrite has gotten more expensive? More expensive than what? Its the opposite: its actually much faster when rewriting to boolean queries in 4.0 because it always works per-segment.
          Hide
          Yonik Seeley added a comment -

          it can't be millions of terms because a million exceeds the boolean max clause count, in which it will always use a filter.

          So depending on exactly how many terms the range query covers, extractTerms may or may not return any.
          So extractTerms() may return 300 terms the first time, and then after someone adds some docs to the index it may suddenly return 0.
          This just strengthens the case that we should be consistent and just always ignore the terms from these MTQs.

          Its incorrect to say rewrite has gotten more expensive? More expensive than what?

          Sorry, I wasn't specific enough. I meant compared to back when Solr had it's own RangeFilter and PrefixFilter that it would wrap in a ConstantScoreQuery. There never was any rewrite-to-boolean-query or consulting the index, so it's obviously a faster rewrite().

          But back to the original question - I still see no reason to request/return/cache terms/stats from these multi-term queries when by definition they should not change the results of the request.

          Show
          Yonik Seeley added a comment - it can't be millions of terms because a million exceeds the boolean max clause count, in which it will always use a filter. So depending on exactly how many terms the range query covers, extractTerms may or may not return any. So extractTerms() may return 300 terms the first time, and then after someone adds some docs to the index it may suddenly return 0. This just strengthens the case that we should be consistent and just always ignore the terms from these MTQs. Its incorrect to say rewrite has gotten more expensive? More expensive than what? Sorry, I wasn't specific enough. I meant compared to back when Solr had it's own RangeFilter and PrefixFilter that it would wrap in a ConstantScoreQuery. There never was any rewrite-to-boolean-query or consulting the index, so it's obviously a faster rewrite(). But back to the original question - I still see no reason to request/return/cache terms/stats from these multi-term queries when by definition they should not change the results of the request.
          Hide
          Uwe Schindler added a comment -

          Sorry, I wasn't specific enough. I meant compared to back when Solr had it's own RangeFilter and PrefixFilter that it would wrap in a ConstantScoreQuery. There never was any rewrite-to-boolean-query or consulting the index, so it's obviously a faster rewrite().

          Just set in Solr the rewrite mode of MTQ to CONSTANT_SCORE_FILTER_REWRITE - done. There is no discussion needed and no custom RangeQuery in Solr.

          Show
          Uwe Schindler added a comment - Sorry, I wasn't specific enough. I meant compared to back when Solr had it's own RangeFilter and PrefixFilter that it would wrap in a ConstantScoreQuery. There never was any rewrite-to-boolean-query or consulting the index, so it's obviously a faster rewrite(). Just set in Solr the rewrite mode of MTQ to CONSTANT_SCORE_FILTER_REWRITE - done. There is no discussion needed and no custom RangeQuery in Solr.
          Hide
          Yonik Seeley added a comment -

          Just set in Solr the rewrite mode of MTQ to CONSTANT_SCORE_FILTER_REWRITE - done.

          Right - I was considering the best way to do this (passing that info around solr about when to use what method).
          It solves both issues - relatively expensive rewrites that are not needed, and ignoring the MTQ terms.

          Show
          Yonik Seeley added a comment - Just set in Solr the rewrite mode of MTQ to CONSTANT_SCORE_FILTER_REWRITE - done. Right - I was considering the best way to do this (passing that info around solr about when to use what method). It solves both issues - relatively expensive rewrites that are not needed, and ignoring the MTQ terms.
          Hide
          Robert Muir added a comment -

          But back to the original question - I still see no reason to request/return/cache terms/stats from these multi-term queries when by definition they should not change the results of the request.

          My original point (forgetting about the specifics of MTQ, how things are being scored, or anything) is still that its a general case of Query that can have lots of Terms.

          So if there are concerns about "lots of terms", I still think its worth considering having some
          limits on how many Terms would be exchanged. Maybe BooleanQuery's max clause count is already good
          enough, but another way to do it would be to have an approximate implementation that approximates
          when the term count for a query gets too high.

          Show
          Robert Muir added a comment - But back to the original question - I still see no reason to request/return/cache terms/stats from these multi-term queries when by definition they should not change the results of the request. My original point (forgetting about the specifics of MTQ, how things are being scored, or anything) is still that its a general case of Query that can have lots of Terms. So if there are concerns about "lots of terms", I still think its worth considering having some limits on how many Terms would be exchanged. Maybe BooleanQuery's max clause count is already good enough, but another way to do it would be to have an approximate implementation that approximates when the term count for a query gets too high.
          Hide
          Markus Jelsma added a comment -

          Any progress to report or does anyone have a patch that is updated for trunk?

          Show
          Markus Jelsma added a comment - Any progress to report or does anyone have a patch that is updated for trunk?
          Hide
          Markus Jelsma added a comment -

          Updated patch to build for rev: 1447516 (Mon, 18 Feb 2013)

          All tests seem to pass.

          Show
          Markus Jelsma added a comment - Updated patch to build for rev: 1447516 (Mon, 18 Feb 2013) All tests seem to pass.
          Hide
          Mark Miller added a comment -

          Nice. I mentioned this to AB not too long ago, but I'm of the mind to simply commit this. It will default to off, and we can continue to work on it.

          So unless someone steps in, I'll commit what Markus has put up.

          Markus, have you tried this out at all beyond the unit tests - eg on a cluster?

          Show
          Mark Miller added a comment - Nice. I mentioned this to AB not too long ago, but I'm of the mind to simply commit this. It will default to off, and we can continue to work on it. So unless someone steps in, I'll commit what Markus has put up. Markus, have you tried this out at all beyond the unit tests - eg on a cluster?
          Hide
          Markus Jelsma added a comment -

          No, not yet. Please let me do some real tests, there must be issues, the patch is over a year old!

          Show
          Markus Jelsma added a comment - No, not yet. Please let me do some real tests, there must be issues, the patch is over a year old!
          Hide
          Markus Jelsma added a comment -

          It doesn't really seem to work, we're seeing lots of NPE's and if a response comes through IDF is not consistent for all terms. Most request return one of the NPE's below. Sometimes it works, and then the second request just fails.

          java.lang.NullPointerException
          	at org.apache.solr.search.stats.ExactStatsCache.sendGlobalStats(LRUStatsCache.java:202)
          	at org.apache.solr.handler.component.QueryComponent.createMainQuery(QueryComponent.java:783)
          	at org.apache.solr.handler.component.QueryComponent.regularDistributedProcess(QueryComponent.java:618)
          	at...
          
          java.lang.NullPointerException
          	at org.apache.solr.search.stats.LRUStatsCache.sendGlobalStats(LRUStatsCache.java:228)
          	at org.apache.solr.handler.component.QueryComponent.createMainQuery(QueryComponent.java:783)
          	at org.apache.solr.handler.component.QueryComponent.regularDistributedProcess(QueryComponent.java:618)
          	at...
          

          We also see this one from time to time, it looks like this is thrown is there are `no servers hosting shard`:

          java.lang.NullPointerException
          	at org.apache.solr.search.stats.LRUStatsCache.mergeToGlobalStats(LRUStatsCache.java:112)
          	at org.apache.solr.handler.component.QueryComponent.updateStats(QueryComponent.java:743)
          	at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:659)
          	at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:634)
          	at ..
          

          It's also imposes a huge performance penalty with both LRUStatsCache and ExactStatsCache, if you're used to 40ms response times you'll see the average jump to 2 seconds with very frequent 5 second spikes. Performance stays poor if logging is disabled.

          The logs are also swamped with logs like:

          2013-02-20 11:54:48,091 WARN [search.stats.LRUStatsCache] - [http-8080-exec-5] - : ## Missing global colStats info: <FIELD>, using local
          2013-02-20 11:54:48,091 WARN [search.stats.LRUStatsCache] - [http-8080-exec-5] - : ## Missing global termStats info: <FIELD>:<TERM>, using local
          

          Both StatsCacheImpls behave like this. Each query logs lines like above. Maybe performance is poor because it tries to look up terms everytime but i'm not sure yet.

          Finally something crazy i'd like to share

          -Infinity = (MATCH) sum of:
            -Infinity = (MATCH) max plus 0.35 times others of:
              -Infinity = (MATCH) weight(content_nl:amsterdam^1.6 in 449) [], result of:
                -Infinity = score(doc=449,freq=1.0 = termFreq=1.0
          ), product of:
                  1.6 = boost
                  -Infinity = idf(docFreq=29800090, docCount=-1)
                  1.0 = tfNorm, computed from:
                    1.0 = termFreq=1.0
                    1.2 = parameter k1
                    0.0 = parameter b (norms omitted for field)
          

          If someone happens to recognize the issues above, i'm all ears

          Show
          Markus Jelsma added a comment - It doesn't really seem to work, we're seeing lots of NPE's and if a response comes through IDF is not consistent for all terms. Most request return one of the NPE's below. Sometimes it works, and then the second request just fails. java.lang.NullPointerException at org.apache.solr.search.stats.ExactStatsCache.sendGlobalStats(LRUStatsCache.java:202) at org.apache.solr.handler.component.QueryComponent.createMainQuery(QueryComponent.java:783) at org.apache.solr.handler.component.QueryComponent.regularDistributedProcess(QueryComponent.java:618) at... java.lang.NullPointerException at org.apache.solr.search.stats.LRUStatsCache.sendGlobalStats(LRUStatsCache.java:228) at org.apache.solr.handler.component.QueryComponent.createMainQuery(QueryComponent.java:783) at org.apache.solr.handler.component.QueryComponent.regularDistributedProcess(QueryComponent.java:618) at... We also see this one from time to time, it looks like this is thrown is there are `no servers hosting shard`: java.lang.NullPointerException at org.apache.solr.search.stats.LRUStatsCache.mergeToGlobalStats(LRUStatsCache.java:112) at org.apache.solr.handler.component.QueryComponent.updateStats(QueryComponent.java:743) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:659) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:634) at .. It's also imposes a huge performance penalty with both LRUStatsCache and ExactStatsCache, if you're used to 40ms response times you'll see the average jump to 2 seconds with very frequent 5 second spikes. Performance stays poor if logging is disabled. The logs are also swamped with logs like: 2013-02-20 11:54:48,091 WARN [search.stats.LRUStatsCache] - [http-8080-exec-5] - : ## Missing global colStats info: <FIELD>, using local 2013-02-20 11:54:48,091 WARN [search.stats.LRUStatsCache] - [http-8080-exec-5] - : ## Missing global termStats info: <FIELD>:<TERM>, using local Both StatsCacheImpls behave like this. Each query logs lines like above. Maybe performance is poor because it tries to look up terms everytime but i'm not sure yet. Finally something crazy i'd like to share -Infinity = (MATCH) sum of: -Infinity = (MATCH) max plus 0.35 times others of: -Infinity = (MATCH) weight(content_nl:amsterdam^1.6 in 449) [], result of: -Infinity = score(doc=449,freq=1.0 = termFreq=1.0 ), product of: 1.6 = boost -Infinity = idf(docFreq=29800090, docCount=-1) 1.0 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) If someone happens to recognize the issues above, i'm all ears
          Hide
          Mark Miller added a comment -

          Hmm, that makes it look like the current tests for this must be pretty weak then.

          Show
          Mark Miller added a comment - Hmm, that makes it look like the current tests for this must be pretty weak then.
          Hide
          Markus Jelsma added a comment -

          Things have changed a lot in the past 13 months and i haven't figured it all out yet. I'll try to make sense out of it but some expert opinion and trial on the patch and all would be more than helpful. Is Andrzej not around?

          Show
          Markus Jelsma added a comment - Things have changed a lot in the past 13 months and i haven't figured it all out yet. I'll try to make sense out of it but some expert opinion and trial on the patch and all would be more than helpful. Is Andrzej not around?
          Hide
          Markus Jelsma added a comment -

          Updated patch for trunk:
          Last Changed Rev: 1488431
          Last Changed Date: 2013-06-01 01:42:51 +0200 (Sat, 01 Jun 2013)

          Show
          Markus Jelsma added a comment - Updated patch for trunk: Last Changed Rev: 1488431 Last Changed Date: 2013-06-01 01:42:51 +0200 (Sat, 01 Jun 2013)
          Hide
          David Boychuck added a comment -

          is this patch currently working in 5.0?

          Show
          David Boychuck added a comment - is this patch currently working in 5.0?
          Hide
          Markus Jelsma added a comment -

          No, it does not work at all. I did spend some time on it but had other things to do. In the end i removed my (not working) changes and uploaded a patch that at least compiles against the revision of that time.

          Show
          Markus Jelsma added a comment - No, it does not work at all. I did spend some time on it but had other things to do. In the end i removed my (not working) changes and uploaded a patch that at least compiles against the revision of that time.
          Hide
          Markus Jelsma added a comment -

          Ok, i updated the patch for today's trunk and it actually works now with ExactStatsCache. We now have correct DF for distributed queries.

          I removed the perReaderTermContext in ExactStatsCache, this cached the TermContext for new terms. This was a problem because caching it this way meant that any second term got the same DF as the first.

          I also added a local boolean to SolrIndexSearcher's collectionStatistics() and termStatistics() to force it to return only local scores. This is a nasty hack to prevent it from returning the other shard's DF. Without this, DF will increase for every other request, in the end it will crash the systems because the number gets too high.

          Also, the warning ## Missing global termStats info: " + term + ", using local should perhaps not be a warning at all. This gets emitted also for fields not having those terms. The check in returnLocalStats doesn't add terms for docFreq == 0.

          Add <globalStats class="org.apache.solr.search.stats.ExactStatsCache"/> to your solrconfig in the config section to make it work.

          Please check my patch and let's fix this issue so we hopefully can get distributed IDF in Solr 4.7.

          Show
          Markus Jelsma added a comment - Ok, i updated the patch for today's trunk and it actually works now with ExactStatsCache. We now have correct DF for distributed queries. I removed the perReaderTermContext in ExactStatsCache, this cached the TermContext for new terms. This was a problem because caching it this way meant that any second term got the same DF as the first. I also added a local boolean to SolrIndexSearcher's collectionStatistics() and termStatistics() to force it to return only local scores. This is a nasty hack to prevent it from returning the other shard's DF. Without this, DF will increase for every other request, in the end it will crash the systems because the number gets too high. Also, the warning ## Missing global termStats info: " + term + ", using local should perhaps not be a warning at all. This gets emitted also for fields not having those terms. The check in returnLocalStats doesn't add terms for docFreq == 0. Add <globalStats class="org.apache.solr.search.stats.ExactStatsCache"/> to your solrconfig in the config section to make it work. Please check my patch and let's fix this issue so we hopefully can get distributed IDF in Solr 4.7.
          Hide
          Mark Miller added a comment -

          I'm looking at a couple of the test fails before I go to bed tonight:

          [junit4] Tests with failures:
          [junit4] - org.apache.solr.handler.component.QueryElevationComponentTest.testGroupedQuery
          [junit4] - org.apache.solr.TestDistributedSearch.testDistribSearch
          [junit4] - org.apache.solr.search.stats.TestLRUStatsCache.testDistribSearch
          [junit4] - org.apache.solr.TestGroupingSearch.testGroupingGroupSortingScore_basicWithGroupSortEqualToSort
          [junit4] - org.apache.solr.TestGroupingSearch.testGroupingGroupSortingScore_withTotalGroupCount
          [junit4] - org.apache.solr.TestGroupingSearch.testGroupingGroupSortingScore_basic
          [junit4] - org.apache.solr.search.stats.TestExactStatsCache.testDistribSearch
          [junit4] - org.apache.solr.update.AddBlockUpdateTest.testXML
          [junit4] - org.apache.solr.update.AddBlockUpdateTest.testSolrJXML

          Show
          Mark Miller added a comment - I'm looking at a couple of the test fails before I go to bed tonight: [junit4] Tests with failures: [junit4] - org.apache.solr.handler.component.QueryElevationComponentTest.testGroupedQuery [junit4] - org.apache.solr.TestDistributedSearch.testDistribSearch [junit4] - org.apache.solr.search.stats.TestLRUStatsCache.testDistribSearch [junit4] - org.apache.solr.TestGroupingSearch.testGroupingGroupSortingScore_basicWithGroupSortEqualToSort [junit4] - org.apache.solr.TestGroupingSearch.testGroupingGroupSortingScore_withTotalGroupCount [junit4] - org.apache.solr.TestGroupingSearch.testGroupingGroupSortingScore_basic [junit4] - org.apache.solr.search.stats.TestExactStatsCache.testDistribSearch [junit4] - org.apache.solr.update.AddBlockUpdateTest.testXML [junit4] - org.apache.solr.update.AddBlockUpdateTest.testSolrJXML
          Hide
          Mark Miller added a comment -

          I did not do a thorough review or anything, but here is a patch...

          • I cleaned up a lot of things.
          • I fixed the things that needed to be fixed for the tests to pass.
          • I got precommit passing (though I may have added back in a nocommit after).

          Anyway, tests seem to pass for me.

          Show
          Mark Miller added a comment - I did not do a thorough review or anything, but here is a patch... I cleaned up a lot of things. I fixed the things that needed to be fixed for the tests to pass. I got precommit passing (though I may have added back in a nocommit after). Anyway, tests seem to pass for me.
          Hide
          Mark Miller added a comment -

          More cleanup in this patch.

          Show
          Mark Miller added a comment - More cleanup in this patch.
          Hide
          Mark Miller added a comment -

          The config you need to use to turn this on is now:

          <statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>

          It needs to go in the top level config section.

          Show
          Mark Miller added a comment - The config you need to use to turn this on is now: <statsCache class="org.apache.solr.search.stats.ExactStatsCache"/> It needs to go in the top level config section.
          Hide
          Mark Miller added a comment -

          The thread local still scares me ... need to look closer at that.

          Show
          Mark Miller added a comment - The thread local still scares me ... need to look closer at that.
          Hide
          Mark Miller added a comment -

          I've got two main concerns - the thread local and it looks like the statscache is not thread safe but shared across threads.

          The threadlocal is concerning because you can have thousands of threads and each will cache how many stats? I wish we could do something better.

          Show
          Mark Miller added a comment - I've got two main concerns - the thread local and it looks like the statscache is not thread safe but shared across threads. The threadlocal is concerning because you can have thousands of threads and each will cache how many stats? I wish we could do something better.
          Hide
          Mark Miller added a comment -

          This patch remove the new thread local by piggy backing on the existing thread local Solr uses for a request (which is already nicely cleaned up per request).

          I also attempted to make ExactStatsCache thread safe, but the whole design there needs a review I think.

          LRUStatsCache is certainly still not thread safe and needs to be fixed.

          Show
          Mark Miller added a comment - This patch remove the new thread local by piggy backing on the existing thread local Solr uses for a request (which is already nicely cleaned up per request). I also attempted to make ExactStatsCache thread safe, but the whole design there needs a review I think. LRUStatsCache is certainly still not thread safe and needs to be fixed.
          Hide
          Mark Miller added a comment -

          Markus Jelsma, how was performance with your most recent patch compared to what you first reported?

          Show
          Mark Miller added a comment - Markus Jelsma , how was performance with your most recent patch compared to what you first reported?
          Hide
          Mark Miller added a comment -

          Whoops - attached the wrong patch this morning. Anyway, here is a new one.

          • Attempted to make LRUStatsCache thread safe.
          • Lot's of little clean up / little improvements
          Show
          Mark Miller added a comment - Whoops - attached the wrong patch this morning. Anyway, here is a new one. Attempted to make LRUStatsCache thread safe. Lot's of little clean up / little improvements
          Hide
          Markus Jelsma added a comment -

          It is much faster now, even usable. But i haven't tried it in a larger cluster yet.

          Show
          Markus Jelsma added a comment - It is much faster now, even usable. But i haven't tried it in a larger cluster yet.
          Hide
          Mark Miller added a comment -

          Last patch was doubled - pasted twice I guess. Here is a clean one.

          Show
          Mark Miller added a comment - Last patch was doubled - pasted twice I guess. Here is a clean one.
          Hide
          Mark Miller added a comment -

          Here is my latest work I was playing around with the other night. A lot more cleanup, removed a bunch of dupe code, etc.

          I think things are fairly reasonable now given the current design.

          I do think we want to look at the design to make sure it's going to work well with the caching impls that we will want to add.

          Show
          Mark Miller added a comment - Here is my latest work I was playing around with the other night. A lot more cleanup, removed a bunch of dupe code, etc. I think things are fairly reasonable now given the current design. I do think we want to look at the design to make sure it's going to work well with the caching impls that we will want to add.
          Hide
          Vitaliy Zhovtyuk added a comment -

          Updated to latest trunk.
          Cleaned code duplicates. Fixed org.apache.solr.search.stats.TestLRUStatsCache, added test for org.apache.solr.search.stats.ExactSharedStatsCache.
          Fixed javadocs.

          Show
          Vitaliy Zhovtyuk added a comment - Updated to latest trunk. Cleaned code duplicates. Fixed org.apache.solr.search.stats.TestLRUStatsCache, added test for org.apache.solr.search.stats.ExactSharedStatsCache. Fixed javadocs.
          Hide
          Markus Jelsma added a comment -

          Hi Vitaly, are you sure it still works? I tried your and few older patches again but docCounts are no longer the sum of the cluster size. The GET_STATS query is executed though.

          Two node test cluster:

           384841 [qtp1175813699-17] INFO  org.apache.solr.core.SolrCore  – [collection1] webapp=/solr path=/select params={distrib=false&debug=track&wt=javabin&requestPurpose=GET_TERM_STATS&version=2&rows=10&debugQuery=false&shard.url=http://127.0.1.1:8983/solr/collection1/&NOW=1394444039677&rid=-collection1-1394444039677-12&shards.purpose=2&q=wiki&isShard=true} status=0 QTime=1 
          384848 [qtp1175813699-17] INFO  org.apache.solr.core.SolrCore  – [collection1] webapp=/solr path=/select params={distrib=false&debug=track&wt=javabin&requestPurpose=GET_TOP_IDS,GET_STATS,GET_TERMS,GET_MLT_RESULTS,SET_TERM_STATS&version=2&rows=10&org.apache.solr.stats.colStats=content_nl,121630,115956,16436279,11372267&org.apache.solr.stats.terms=content_nl:wiki&NOW=1394444039677&shard.url=http://127.0.1.1:8983/solr/collection1/&debugQuery=false&fl=id,score&shards.purpose=5636&rid=-collection1-1394444039677-12&start=0&q=wiki&org.apache.solr.stats.termStats=content_nl:wiki,284,645&isShard=true&fsv=true} hits=138 status=0 QTime=1 
          384863 [qtp1175813699-17] INFO  org.apache.solr.core.SolrCore  – [collection1] webapp=/solr path=/select params={ids=http://nl.wikipedia.org/wiki/Overleg_sjabloon:Infobox_film,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Navigatie_Bijbel,http://nl.wikipedia.org/wiki/Overleg_help:Gebruik_van_sjablonen,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Citeer_boek,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Wikt&distrib=false&debug=track&wt=javabin&requestPurpose=GET_FIELDS,GET_DEBUG&version=2&rows=10&debugQuery=true&shard.url=http://127.0.1.1:8983/solr/collection1/&NOW=1394444039677&rid=-collection1-1394444039677-12&shards.purpose=320&q=wiki&isShard=true} status=0 QTime=7 
          384870 [qtp1175813699-13] INFO  org.apache.solr.core.SolrCore  – [collection1] webapp=/solr path=/select params={debugQuery=true&q=wiki} rid=-collection1-1394444039677-12 hits=284 status=0 QTime=33 
          
          
          380242 [qtp1175813699-16] INFO  org.apache.solr.core.SolrCore  – [collection1] webapp=/solr path=/select params={distrib=false&debug=track&wt=javabin&requestPurpose=GET_TERM_STATS&version=2&rows=10&debugQuery=false&shard.url=http://127.0.1.1:7574/solr/collection1/&NOW=1394444039677&rid=-collection1-1394444039677-12&shards.purpose=2&q=wiki&isShard=true} status=0 QTime=0 
          380249 [qtp1175813699-16] INFO  org.apache.solr.core.SolrCore  – [collection1] webapp=/solr path=/select params={distrib=false&debug=track&wt=javabin&requestPurpose=GET_TOP_IDS,GET_STATS,GET_TERMS,GET_MLT_RESULTS,SET_TERM_STATS&version=2&rows=10&org.apache.solr.stats.colStats=content_nl,121630,115956,16436279,11372267&org.apache.solr.stats.terms=content_nl:wiki&NOW=1394444039677&shard.url=http://127.0.1.1:7574/solr/collection1/&debugQuery=false&fl=id,score&shards.purpose=5636&rid=-collection1-1394444039677-12&start=0&q=wiki&org.apache.solr.stats.termStats=content_nl:wiki,284,645&isShard=true&fsv=true} hits=146 status=0 QTime=2 
          380263 [qtp1175813699-16] INFO  org.apache.solr.core.SolrCore  – [collection1] webapp=/solr path=/select params={ids=http://nl.wikipedia.org/wiki/Overleg_sjabloon:Navigatie,http://nl.wikipedia.org/wiki/Overleg_help:Waarom_staat_mijn_bestand_op_de_beoordelingslijst,http://nl.wikipedia.org/wiki/Overleg_help:Wikipediachat,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Coördinaten,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Sjabdoc/doc&distrib=false&debug=track&wt=javabin&requestPurpose=GET_FIELDS,GET_DEBUG&version=2&rows=10&debugQuery=true&shard.url=http://127.0.1.1:7574/solr/collection1/&NOW=1394444039677&rid=-collection1-1394444039677-12&shards.purpose=320&q=wiki&isShard=true} status=0 QTime=6 
          

          But i get these scores:

          
          12.8123455 = (MATCH) weight(content_nl:wiki in 18636) [], result of:
            12.8123455 = score(doc=18636,freq=33.0 = termFreq=33.0
          ), product of:
              6.0355678 = idf(docFreq=138, docCount=57897)
              2.122807 = tfNorm, computed from:
                33.0 = termFreq=33.0
                1.2 = parameter k1
                0.0 = parameter b (norms omitted for field)
          
          
          12.558066 = (MATCH) weight(content_nl:wiki in 60634) [], result of:
            12.558066 = score(doc=60634,freq=25.0 = termFreq=25.0
          ), product of:
              5.982207 = idf(docFreq=146, docCount=58059)
              2.0992365 = tfNorm, computed from:
                25.0 = termFreq=25.0
                1.2 = parameter k1
                0.0 = parameter b (norms omitted for field)
          

          Did it work for you?

          Show
          Markus Jelsma added a comment - Hi Vitaly, are you sure it still works? I tried your and few older patches again but docCounts are no longer the sum of the cluster size. The GET_STATS query is executed though. Two node test cluster: 384841 [qtp1175813699-17] INFO org.apache.solr.core.SolrCore – [collection1] webapp=/solr path=/select params={distrib= false &debug=track&wt=javabin&requestPurpose=GET_TERM_STATS&version=2&rows=10&debugQuery= false &shard.url=http: //127.0.1.1:8983/solr/collection1/&NOW=1394444039677&rid=-collection1-1394444039677-12&shards.purpose=2&q=wiki&isShard= true } status=0 QTime=1 384848 [qtp1175813699-17] INFO org.apache.solr.core.SolrCore – [collection1] webapp=/solr path=/select params={distrib= false &debug=track&wt=javabin&requestPurpose=GET_TOP_IDS,GET_STATS,GET_TERMS,GET_MLT_RESULTS,SET_TERM_STATS&version=2&rows=10&org.apache.solr.stats.colStats=content_nl,121630,115956,16436279,11372267&org.apache.solr.stats.terms=content_nl:wiki&NOW=1394444039677&shard.url=http: //127.0.1.1:8983/solr/collection1/&debugQuery= false &fl=id,score&shards.purpose=5636&rid=-collection1-1394444039677-12&start=0&q=wiki&org.apache.solr.stats.termStats=content_nl:wiki,284,645&isShard= true &fsv= true } hits=138 status=0 QTime=1 384863 [qtp1175813699-17] INFO org.apache.solr.core.SolrCore – [collection1] webapp=/solr path=/select params={ids=http: //nl.wikipedia.org/wiki/Overleg_sjabloon:Infobox_film,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Navigatie_Bijbel,http://nl.wikipedia.org/wiki/Overleg_help:Gebruik_van_sjablonen,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Citeer_boek,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Wikt&distrib= false &debug=track&wt=javabin&requestPurpose=GET_FIELDS,GET_DEBUG&version=2&rows=10&debugQuery= true &shard.url=http://127.0.1.1:8983/solr/collection1/&NOW=1394444039677&rid=-collection1-1394444039677-12&shards.purpose=320&q=wiki&isShard= true } status=0 QTime=7 384870 [qtp1175813699-13] INFO org.apache.solr.core.SolrCore – [collection1] webapp=/solr path=/select params={debugQuery= true &q=wiki} rid=-collection1-1394444039677-12 hits=284 status=0 QTime=33 380242 [qtp1175813699-16] INFO org.apache.solr.core.SolrCore – [collection1] webapp=/solr path=/select params={distrib= false &debug=track&wt=javabin&requestPurpose=GET_TERM_STATS&version=2&rows=10&debugQuery= false &shard.url=http: //127.0.1.1:7574/solr/collection1/&NOW=1394444039677&rid=-collection1-1394444039677-12&shards.purpose=2&q=wiki&isShard= true } status=0 QTime=0 380249 [qtp1175813699-16] INFO org.apache.solr.core.SolrCore – [collection1] webapp=/solr path=/select params={distrib= false &debug=track&wt=javabin&requestPurpose=GET_TOP_IDS,GET_STATS,GET_TERMS,GET_MLT_RESULTS,SET_TERM_STATS&version=2&rows=10&org.apache.solr.stats.colStats=content_nl,121630,115956,16436279,11372267&org.apache.solr.stats.terms=content_nl:wiki&NOW=1394444039677&shard.url=http: //127.0.1.1:7574/solr/collection1/&debugQuery= false &fl=id,score&shards.purpose=5636&rid=-collection1-1394444039677-12&start=0&q=wiki&org.apache.solr.stats.termStats=content_nl:wiki,284,645&isShard= true &fsv= true } hits=146 status=0 QTime=2 380263 [qtp1175813699-16] INFO org.apache.solr.core.SolrCore – [collection1] webapp=/solr path=/select params={ids=http: //nl.wikipedia.org/wiki/Overleg_sjabloon:Navigatie,http://nl.wikipedia.org/wiki/Overleg_help:Waarom_staat_mijn_bestand_op_de_beoordelingslijst,http://nl.wikipedia.org/wiki/Overleg_help:Wikipediachat,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Coördinaten,http://nl.wikipedia.org/wiki/Overleg_sjabloon:Sjabdoc/doc&distrib= false &debug=track&wt=javabin&requestPurpose=GET_FIELDS,GET_DEBUG&version=2&rows=10&debugQuery= true &shard.url=http://127.0.1.1:7574/solr/collection1/&NOW=1394444039677&rid=-collection1-1394444039677-12&shards.purpose=320&q=wiki&isShard= true } status=0 QTime=6 But i get these scores: 12.8123455 = (MATCH) weight(content_nl:wiki in 18636) [], result of: 12.8123455 = score(doc=18636,freq=33.0 = termFreq=33.0 ), product of: 6.0355678 = idf(docFreq=138, docCount=57897) 2.122807 = tfNorm, computed from: 33.0 = termFreq=33.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 12.558066 = (MATCH) weight(content_nl:wiki in 60634) [], result of: 12.558066 = score(doc=60634,freq=25.0 = termFreq=25.0 ), product of: 5.982207 = idf(docFreq=146, docCount=58059) 2.0992365 = tfNorm, computed from: 25.0 = termFreq=25.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) Did it work for you?
          Hide
          Mark Miller added a comment -

          I tried your and few older patches again but docCounts are no longer the sum of the cluster size.

          Do you see what is missing in the tests to catch this?

          Show
          Mark Miller added a comment - I tried your and few older patches again but docCounts are no longer the sum of the cluster size. Do you see what is missing in the tests to catch this?
          Hide
          Markus Jelsma added a comment -

          No, but i think this happened when the QueryCommand code

              public StatsSource getStatsSource() { return statsSource; }
              public QueryCommand setStatsSource(StatsSource dfSource) {
                this.statsSource = dfSource;
                return this;
              }
          

          got removed.

          Show
          Markus Jelsma added a comment - No, but i think this happened when the QueryCommand code public StatsSource getStatsSource() { return statsSource; } public QueryCommand setStatsSource(StatsSource dfSource) { this .statsSource = dfSource; return this ; } got removed.
          Hide
          Vitaliy Zhovtyuk added a comment -
          • Fixed global stats distribution
          • Added assert on query explain (docNum, weight and idf should be the same in distributed tests), this assert is valid on 2nd query only since global stats merged in the end of 1st query.
          Show
          Vitaliy Zhovtyuk added a comment - Fixed global stats distribution Added assert on query explain (docNum, weight and idf should be the same in distributed tests), this assert is valid on 2nd query only since global stats merged in the end of 1st query.
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Andrzej Bialecki
            • Votes:
              28 Vote for this issue
              Watchers:
              46 Start watching this issue

              Dates

              • Created:
                Updated:

                Development