Issue Details (XML | Word | Printable)

Key: SOLR-711
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Unassigned
Reporter: Fuad Efendi
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Solr

SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors

Created: 19/Aug/08 07:55 PM   Updated: 17/Dec/08 04:48 PM
Component/s: search
Affects Version/s: 1.3
Fix Version/s: 1.4

Time Tracking:
Original Estimate: 1680h
Original Estimate - 1680h
Remaining Estimate: 1680h
Remaining Estimate - 1680h
Time Spent: Not Specified
Remaining Estimate - 1680h

Resolution Date: 17/Dec/08 04:48 PM


 Description  « Hide
From http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html:

Scenario:

  • 10,000,000 documents in the index;
  • 5-10 terms per document;
  • 200,000 unique terms for a tokenized field.

Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms.

Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

See SimpleFacets.java:

public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher, 
  DocSet docs, ...


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Toby Cole added a comment - 20/Aug/08 10:18 AM
We've seen this problem with our dataset, we have around 10m small records and were trying to facet on several multi-valued strings. Two of which had over 40k unique values (around 10 values per record).
If we can come up with a plan I don't mind volunteering to implement it.

Shalin Shekhar Mangar added a comment - 20/Sep/08 03:27 PM
What should be a good criteria to switch between the current and proposed strategy?

Fuad, did you run any tests to find a magic ratio between unique tokens and DocSet size?


Shalin Shekhar Mangar added a comment - 17/Dec/08 12:18 PM
With the new performance changes in faceting with SOLR-475, is this issue still relevant?

Fuad Efendi added a comment - 17/Dec/08 04:48 PM
Thanks Shalin for pointing to SOLR-475 which is very advanced solution to term counting approach.