Issue Details (XML | Word | Printable)

Key: SOLR-711
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Unassigned
Reporter: Fuad Efendi
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Solr

SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors

Created: 19/Aug/08 07:55 PM   Updated: 17/Dec/08 04:48 PM
Return to search
Component/s: search
Affects Version/s: 1.3
Fix Version/s: 1.4

Time Tracking:
Original Estimate: 1680h
Original Estimate - 1680h
Remaining Estimate: 1680h
Remaining Estimate - 1680h
Time Spent: Not Specified
Remaining Estimate - 1680h

Resolution Date: 17/Dec/08 04:48 PM


 Description  « Hide
From http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html:

Scenario:

  • 10,000,000 documents in the index;
  • 5-10 terms per document;
  • 200,000 unique terms for a tokenized field.

Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms.

Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

See SimpleFacets.java:

public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher, 
  DocSet docs, ...


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Fuad Efendi made changes - 19/Aug/08 08:01 PM
Field Original Value New Value
Description From [url]http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html[/url]:

Scenario:
- 10,000,000 documents in the index;
- 5-10 terms per document;
- 200,000 unique terms for a tokenized field.

_Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._

Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

See SimpleFacets:
 {{
public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher,
  DocSet docs,
  String field,
  int offset,
  int limit,
  int mincount,
  boolean missing,
  boolean sort,
  String prefix)
throws IOException {...}
}}


From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:

Scenario:
- 10,000,000 documents in the index;
- 5-10 terms per document;
- 200,000 unique terms for a tokenized field.

_Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._

Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

See SimpleFacets:
{code:title=SimpleFacets.java|borderStyle=solid}
public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher,
  DocSet docs, ...
{code}


Fuad Efendi made changes - 19/Aug/08 08:01 PM
Comment [ trivial formatting ]
Fuad Efendi made changes - 19/Aug/08 08:02 PM
Description From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:

Scenario:
- 10,000,000 documents in the index;
- 5-10 terms per document;
- 200,000 unique terms for a tokenized field.

_Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._

Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

See SimpleFacets:
{code:title=SimpleFacets.java|borderStyle=solid}
public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher,
  DocSet docs, ...
{code}


From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:

Scenario:
- 10,000,000 documents in the index;
- 5-10 terms per document;
- 200,000 unique terms for a tokenized field.

_Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._

Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario).

See SimpleFacets.java:
{code}
public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher,
  DocSet docs, ...
{code}


Fuad Efendi made changes - 17/Dec/08 04:48 PM
Status Open [ 1 ] Closed [ 6 ]
Resolution Fixed [ 1 ]