[LUCENE-4286] Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 4.0-ALPHA, 3.6.1
Fix Version/s: 4.0-BETA, 6.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Add an optional flag to the CJKBigramFilter to tell it to also output unigrams. This would allow indexing of both bigrams and unigrams and at query time the analyzer could analyze queries as bigrams unless the query contained a single Han unigram.

As an example here is a configuration a Solr fieldType with the analyzer for indexing with the "indexUnigrams" flag set and the analyzer for querying without the flag.

Use case: About 10% of our queries that contain Han characters are single character queries. The CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input. This means we have to create a separate field to index Han unigrams in order to address single character queries and then write application code to search that separate field if we detect a single character Han query. This is rather kludgey. With the optional flag, we could configure Solr as above

This is somewhat analogous to the flags in ~~LUCENE-1370~~ for the ShingleFilter used to allow single word queries (although that uses word n-grams rather than character n-grams.)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-4286.patch_3.x
29/Nov/12 18:03
20 kB
Tom Burton-West
LUCENE-4286.patch
04/Aug/12 01:54
17 kB
Robert Muir
LUCENE-4286.patch
04/Aug/12 00:22
7 kB
Robert Muir

Activity

People

Assignee:: Unassigned

Reporter:: Tom Burton-West

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Aug/12 22:56

Updated:: 28/Aug/22 13:23

Resolved:: 04/Aug/12 22:42