Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5558

Add TruncateTokenFilter

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 4.7
    • 4.8, 6.0
    • modules/analysis
    • New, Patch Available

    Description

      I am using this filter as a stemmer for Turkish language. In many academic research (classification, retrieval) it is used and called as Fixed Prefix Stemmer or Simple Truncation Method or F5 in short.

      Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish language in Information Retrieval on Turkish Texts. It is the same work where most of stopwords_tr.txt are acquired.

      ElasticSearch has truncate filter but it does not respect keyword attribute. And it has a use case similar to TruncateFieldUpdateProcessorFactory

      Main advantage of F5 stemming is : it does not effected by the meaning loss caused by ascii folding. It is a diacritics-insensitive stemmer and works well with ascii folding. Effects of diacritics on Turkish information retrieval

      Here is the full field type I use for "diacritics-insensitive search" for Turkish

       <fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
         <analyzer>
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.ApostropheFilterFactory"/>
           <filter class="solr.TurkishLowerCaseFilterFactory"/>
           <filter class="solr.ASCIIFoldingFilterFactory"/>
           <filter class="solr.KeywordRepeatFilterFactory"/>
           <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
         </analyzer>
      

      I would like to get community opinions :

      1) Any interest in this?
      2) keyword attribute should be respected?
      3) package name analysis.misc versus analyis.tr
      4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter

      Attachments

        1. LUCENE-5558.patch
          12 kB
          Ahmet Arslan
        2. LUCENE-5558.patch
          13 kB
          Ahmet Arslan
        3. LUCENE-5558.patch
          12 kB
          Ahmet Arslan
        4. LUCENE-5558.patch
          11 kB
          Ahmet Arslan

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            rcmuir Robert Muir
            iorixxx Ahmet Arslan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment