[LUCENE-5558] Add TruncateTokenFilter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 4.7
Fix Version/s: 4.8, 6.0
Component/s: modules/analysis
Labels:
- Turkish
- f5

Lucene Fields:

New, Patch Available

Description

I am using this filter as a stemmer for Turkish language. In many academic research (classification, retrieval) it is used and called as Fixed Prefix Stemmer or Simple Truncation Method or F5 in short.

Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish language in Information Retrieval on Turkish Texts. It is the same work where most of stopwords_tr.txt are acquired.

ElasticSearch has truncate filter but it does not respect keyword attribute. And it has a use case similar to TruncateFieldUpdateProcessorFactory

Main advantage of F5 stemming is : it does not effected by the meaning loss caused by ascii folding. It is a diacritics-insensitive stemmer and works well with ascii folding. Effects of diacritics on Turkish information retrieval

Here is the full field type I use for "diacritics-insensitive search" for Turkish

 <fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="solr.ASCIIFoldingFilterFactory"/>
     <filter class="solr.KeywordRepeatFilterFactory"/>
     <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>

I would like to get community opinions :

1) Any interest in this?
2) keyword attribute should be respected?
3) package name analysis.misc versus analyis.tr
4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5558.patch
28/Mar/14 01:06
11 kB
Ahmet Arslan
LUCENE-5558.patch
28/Mar/14 16:58
12 kB
Ahmet Arslan
LUCENE-5558.patch
28/Mar/14 17:17
13 kB
Ahmet Arslan
LUCENE-5558.patch
31/Mar/14 00:52
12 kB
Ahmet Arslan

Activity

People

Assignee:: Robert Muir

Reporter:: Ahmet Arslan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Mar/14 01:00

Updated:: 28/Aug/22 14:03

Resolved:: 01/Apr/14 04:31