[SOLR-908] Port of Nutch CommonGrams filter to Solr - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4
Component/s: Schema and Analysis
Labels:
None

Description

Phrase queries containing common words are extremely slow. We are reluctant to just use stop words due to various problems with false hits and some things becoming impossible to search with stop words turned on. (For example "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.)

Several postings regarding slow phrase queries have suggested using the approach used by Nutch. Perhaps someone with more Java/Solr experience might take this on.

It should be possible to port the Nutch CommonGrams code to Solr and create a suitable Solr FilterFactory so that it could be used in Solr by listing it in the Solr schema.xml.

"Construct n-grams for frequently occuring terms and phrases while indexing. Optimize phrase queries to use the n-grams. Single terms are still indexed too, with n-grams overlaid."
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-908.patch
23/Jun/09 20:24
46 kB
Tom Burton-West
SOLR-908.patch
28/Jul/09 17:14
47 kB
Tom Burton-West
SOLR-908.patch
28/Jul/09 21:30
91 kB
Jason Rutherglen
SOLR-908.patch
28/Jul/09 21:41
45 kB
Jason Rutherglen
SOLR-908.patch
30/Jul/09 22:22
43 kB
Jason Rutherglen
SOLR-908.patch
07/Aug/09 19:48
38 kB
Jason Rutherglen
SOLR-908.patch
27/Aug/09 17:54
36 kB
Jason Rutherglen
SOLR-908.patch
18/Sep/09 21:46
37 kB
Jason Rutherglen
SOLR-908.patch
18/Sep/09 23:06
38 kB
Jason Rutherglen
CommonGramsPort.zip
03/Apr/09 23:06
14 kB
Tom Burton-West

Issue Links

depends upon

SOLR-1312 BufferedTokenStream should use new Lucene 2.9 TokenStream API

Closed

Activity

People

Assignee:: Yonik Seeley

Reporter:: Tom Burton-West

Votes:: 3 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Dec/08 20:36

Updated:: 02/May/13 02:29

Resolved:: 22/Sep/09 23:02