[SOLR-822] CharFilter - normalize characters before tokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.3
Fix Version/s: 1.4
Component/s: Schema and Analysis
Labels:
None

Description

A new plugin which can be placed in front of <tokenizer/>.

<fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping_ja.txt" />
    <tokenizer class="solr.MappingCJKTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<charFilter/> can be multiple (chained). I'll post a JPEG file to show character normalization sample soon.

MOTIVATION:
In Japan, there are two types of tokenizers – N-gram (CJKTokenizer) and Morphological Analyzer.
When we use morphological analyzer, because the analyzer uses Japanese dictionary to detect terms,
we need to normalize characters before tokenization.

I'll post a patch soon, too.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

character-normalization.JPG
22/Oct/08 10:21
30 kB
Koji Sekiguchi
SOLR-822.patch
22/Oct/08 10:32
52 kB
Koji Sekiguchi
SOLR-822.patch
23/Oct/08 14:31
52 kB
Koji Sekiguchi
sample_mapping_ja.txt
23/Oct/08 15:03
1 kB
Koji Sekiguchi
SOLR-822.patch
29/Oct/08 09:04
48 kB
Koji Sekiguchi
SOLR-822.patch
10/Nov/08 17:37
57 kB
Koji Sekiguchi
sample_mapping_ja.txt
10/Nov/08 17:47
2 kB
Koji Sekiguchi
SOLR-822.patch
11/Nov/08 16:31
60 kB
Koji Sekiguchi
SOLR-822-for-1.3.patch
19/Nov/08 08:28
60 kB
Koji Sekiguchi
SOLR-822-renameMethod.patch
19/Mar/09 07:06
7 kB
Koji Sekiguchi
japanese-h-to-k-mapping.txt
21/May/09 17:37
3 kB
Mark Bennett

Activity

People

Assignee:: Koji Sekiguchi

Reporter:: Koji Sekiguchi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Oct/08 10:18

Updated:: 10/Jan/14 14:26

Resolved:: 19/Mar/09 11:52