Solr
  1. Solr
  2. SOLR-711

SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: search
    • Labels:
      None

      Description

      From http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html:

      Scenario:

      • 10,000,000 documents in the index;
      • 5-10 terms per document;
      • 200,000 unique terms for a tokenized field.

      Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms.

      Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

      See SimpleFacets.java:

      public NamedList getFacetTermEnumCounts(
        SolrIndexSearcher searcher, 
        DocSet docs, ...
      

        Activity

        Fuad Efendi created issue -
        Fuad Efendi made changes -
        Field Original Value New Value
        Description From [url]http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html[/url]:

        Scenario:
        - 10,000,000 documents in the index;
        - 5-10 terms per document;
        - 200,000 unique terms for a tokenized field.

        _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._

        Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

        See SimpleFacets:
         {{
        public NamedList getFacetTermEnumCounts(
          SolrIndexSearcher searcher,
          DocSet docs,
          String field,
          int offset,
          int limit,
          int mincount,
          boolean missing,
          boolean sort,
          String prefix)
        throws IOException {...}
        }}


        From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:

        Scenario:
        - 10,000,000 documents in the index;
        - 5-10 terms per document;
        - 200,000 unique terms for a tokenized field.

        _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._

        Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

        See SimpleFacets:
        {code:title=SimpleFacets.java|borderStyle=solid}
        public NamedList getFacetTermEnumCounts(
          SolrIndexSearcher searcher,
          DocSet docs, ...
        {code}


        Fuad Efendi made changes -
        Comment [ trivial formatting ]
        Fuad Efendi made changes -
        Description From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:

        Scenario:
        - 10,000,000 documents in the index;
        - 5-10 terms per document;
        - 200,000 unique terms for a tokenized field.

        _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._

        Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

        See SimpleFacets:
        {code:title=SimpleFacets.java|borderStyle=solid}
        public NamedList getFacetTermEnumCounts(
          SolrIndexSearcher searcher,
          DocSet docs, ...
        {code}


        From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:

        Scenario:
        - 10,000,000 documents in the index;
        - 5-10 terms per document;
        - 200,000 unique terms for a tokenized field.

        _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._

        Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

        See SimpleFacets.java:
        {code}
        public NamedList getFacetTermEnumCounts(
          SolrIndexSearcher searcher,
          DocSet docs, ...
        {code}


        Fuad Efendi made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Fuad Efendi
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1,680h
              1,680h
              Remaining:
              Remaining Estimate - 1,680h
              1,680h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development