Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.7
    • Fix Version/s: 4.8, 6.0
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:
      New, Patch Available

      Description

      I am using this filter as a stemmer for Turkish language. In many academic research (classification, retrieval) it is used and called as Fixed Prefix Stemmer or Simple Truncation Method or F5 in short.

      Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish language in Information Retrieval on Turkish Texts. It is the same work where most of stopwords_tr.txt are acquired.

      ElasticSearch has truncate filter but it does not respect keyword attribute. And it has a use case similar to TruncateFieldUpdateProcessorFactory

      Main advantage of F5 stemming is : it does not effected by the meaning loss caused by ascii folding. It is a diacritics-insensitive stemmer and works well with ascii folding. Effects of diacritics on Turkish information retrieval

      Here is the full field type I use for "diacritics-insensitive search" for Turkish

       <fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
         <analyzer>
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.ApostropheFilterFactory"/>
           <filter class="solr.TurkishLowerCaseFilterFactory"/>
           <filter class="solr.ASCIIFoldingFilterFactory"/>
           <filter class="solr.KeywordRepeatFilterFactory"/>
           <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
         </analyzer>
      

      I would like to get community opinions :

      1) Any interest in this?
      2) keyword attribute should be respected?
      3) package name analysis.misc versus analyis.tr
      4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter

      1. LUCENE-5558.patch
        12 kB
        Ahmet Arslan
      2. LUCENE-5558.patch
        13 kB
        Ahmet Arslan
      3. LUCENE-5558.patch
        12 kB
        Ahmet Arslan
      4. LUCENE-5558.patch
        11 kB
        Ahmet Arslan

        Activity

        Hide
        Ahmet Arslan added a comment -

        initial patch

        Show
        Ahmet Arslan added a comment - initial patch
        Hide
        Ahmet Arslan added a comment -

        org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings instantiates a TruncateTokenFilter with a prefixLenght of -48. This throws

         java.lang.StringIndexOutOfBoundsException: String index out of range: -48 

        it can be reproducible with :

        ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=B9259B3A46E7F1D6 -Dtests.slow=true -Dtests.locale=da -Dtests.timezone=Asia/Jayapura -Dtests.file.encoding=US-ASCII 
        Show
        Ahmet Arslan added a comment - org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings instantiates a TruncateTokenFilter with a prefixLenght of -48 . This throws java.lang.StringIndexOutOfBoundsException: String index out of range: -48 it can be reproducible with : ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=B9259B3A46E7F1D6 -Dtests.slow= true -Dtests.locale=da -Dtests.timezone=Asia/Jayapura -Dtests.file.encoding=US-ASCII
        Hide
        Robert Muir added a comment -

        In the constructor of truncatetokenfilter, can you throw an illegalargumentexception if someone passes a negative value?

        TestRandomChains is basically enforcing that you get an exception in this case when you construct the analysis chain versus when you actually index documents.

        Show
        Robert Muir added a comment - In the constructor of truncatetokenfilter, can you throw an illegalargumentexception if someone passes a negative value? TestRandomChains is basically enforcing that you get an exception in this case when you construct the analysis chain versus when you actually index documents.
        Hide
        Robert Muir added a comment -

        Also this technique is very general for many languages. I think it should be in .misc package instead.

        Show
        Robert Muir added a comment - Also this technique is very general for many languages. I think it should be in .misc package instead.
        Hide
        Ahmet Arslan added a comment -

        truncate throws exception for negative numbers now. I wonder how other filters (like length filter)does not effected by negative numbers in their constructors.

        Show
        Ahmet Arslan added a comment - truncate throws exception for negative numbers now. I wonder how other filters (like length filter)does not effected by negative numbers in their constructors.
        Hide
        Robert Muir added a comment -

        Providing negative numbers to LengthFilter wont cause any exception at all. So TestRandomChains cannot detect that it is missing some checks...

        Show
        Robert Muir added a comment - Providing negative numbers to LengthFilter wont cause any exception at all. So TestRandomChains cannot detect that it is missing some checks...
        Hide
        Ahmet Arslan added a comment -

        aha even if filters are instantiated with negative numbers, they are not detected by TestRandomChains unless they an exception occurs. Thanks for the explanation. By the way TestRandomChains is cool

        Show
        Ahmet Arslan added a comment - aha even if filters are instantiated with negative numbers, they are not detected by TestRandomChains unless they an exception occurs. Thanks for the explanation. By the way TestRandomChains is cool
        Hide
        Ahmet Arslan added a comment - - edited

        move to miscellaneous package. Same as elastic search's truncate

        Show
        Ahmet Arslan added a comment - - edited move to miscellaneous package. Same as elastic search's truncate
        Hide
        Ahmet Arslan added a comment -

        Following declarations does not throw an Exception but no token survives from them. It is unusual ( and weird) that there is no surviving tokens. What do you think about TestRandomChains detects empty token stream at the end?

        Should these filters validate their integer arguments?

         <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="-10" consumeAllTokens="false" />
        
         <filter class="solr.LengthFilterFactory" min="-5" max="-1" />
        
         <filter class="solr.LimitTokenPositionFilterFactory" maxTokenPosition="-3" />
        
        Show
        Ahmet Arslan added a comment - Following declarations does not throw an Exception but no token survives from them. It is unusual ( and weird) that there is no surviving tokens. What do you think about TestRandomChains detects empty token stream at the end? Should these filters validate their integer arguments? <filter class= "solr.LimitTokenCountFilterFactory" maxTokenCount= "-10" consumeAllTokens= "false" /> <filter class= "solr.LengthFilterFactory" min= "-5" max= "-1" /> <filter class= "solr.LimitTokenPositionFilterFactory" maxTokenPosition= "-3" />
        Hide
        Robert Muir added a comment -

        I think we should add validation for these filters!

        As far as TestRandomChains detecting an empty token stream, this is ok. it generates random data and the chain may correctly remove all the tokens.

        Show
        Robert Muir added a comment - I think we should add validation for these filters! As far as TestRandomChains detecting an empty token stream, this is ok. it generates random data and the chain may correctly remove all the tokens.
        Hide
        Ahmet Arslan added a comment - - edited

        I think we should add validation for these filters!

        Do you want me to open a ticket titled : Validation for TokenFilters having numeric constructor parameter(s)

        Show
        Ahmet Arslan added a comment - - edited I think we should add validation for these filters! Do you want me to open a ticket titled : Validation for TokenFilters having numeric constructor parameter(s)
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        Ahmet Arslan added a comment -

        define constant key string static final

        Show
        Ahmet Arslan added a comment - define constant key string static final
        Hide
        ASF subversion and git services added a comment -

        Commit 1583525 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1583525 ]

        LUCENE-5558: Add TruncateTokenFilter

        Show
        ASF subversion and git services added a comment - Commit 1583525 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1583525 ] LUCENE-5558 : Add TruncateTokenFilter
        Hide
        ASF subversion and git services added a comment -

        Commit 1583527 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1583527 ]

        LUCENE-5558: Add TruncateTokenFilter

        Show
        ASF subversion and git services added a comment - Commit 1583527 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1583527 ] LUCENE-5558 : Add TruncateTokenFilter
        Hide
        Robert Muir added a comment -

        Thanks Ahmet, very nice!

        Show
        Robert Muir added a comment - Thanks Ahmet, very nice!
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Robert Muir
            Reporter:
            Ahmet Arslan
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development