Solr
  1. Solr
  2. SOLR-711

SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: search
    • Labels:
      None

      Description

      From http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html:

      Scenario:

      • 10,000,000 documents in the index;
      • 5-10 terms per document;
      • 200,000 unique terms for a tokenized field.

      Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms.

      Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

      See SimpleFacets.java:

      public NamedList getFacetTermEnumCounts(
        SolrIndexSearcher searcher, 
        DocSet docs, ...
      

        Activity

        Hide
        Toby Cole added a comment -

        We've seen this problem with our dataset, we have around 10m small records and were trying to facet on several multi-valued strings. Two of which had over 40k unique values (around 10 values per record).
        If we can come up with a plan I don't mind volunteering to implement it.

        Show
        Toby Cole added a comment - We've seen this problem with our dataset, we have around 10m small records and were trying to facet on several multi-valued strings. Two of which had over 40k unique values (around 10 values per record). If we can come up with a plan I don't mind volunteering to implement it.
        Hide
        Shalin Shekhar Mangar added a comment -

        What should be a good criteria to switch between the current and proposed strategy?

        Fuad, did you run any tests to find a magic ratio between unique tokens and DocSet size?

        Show
        Shalin Shekhar Mangar added a comment - What should be a good criteria to switch between the current and proposed strategy? Fuad, did you run any tests to find a magic ratio between unique tokens and DocSet size?
        Hide
        Shalin Shekhar Mangar added a comment -

        With the new performance changes in faceting with SOLR-475, is this issue still relevant?

        Show
        Shalin Shekhar Mangar added a comment - With the new performance changes in faceting with SOLR-475 , is this issue still relevant?
        Hide
        Fuad Efendi added a comment -

        Thanks Shalin for pointing to SOLR-475 which is very advanced solution to term counting approach.

        Show
        Fuad Efendi added a comment - Thanks Shalin for pointing to SOLR-475 which is very advanced solution to term counting approach.

          People

          • Assignee:
            Unassigned
            Reporter:
            Fuad Efendi
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1,680h
              1,680h
              Remaining:
              Remaining Estimate - 1,680h
              1,680h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development