Details

    • Lucene Fields:
      New, Patch Available

      Description

      Add a TokenFilter that strips characters after an apostrophe (including the apostrophe itself).

      1. LUCENE-5482.patch
        12 kB
        Ahmet Arslan
      2. LUCENE-5482.patch
        11 kB
        Ahmet Arslan
      3. LUCENE-5482.patch
        15 kB
        Ahmet Arslan

        Activity

        Hide
        Ahmet Arslan added a comment -

        This is similar to ClassicFilter that removes 's from the end of words. But ClassicFilter is useful for English language only and has nothing to do with Turkish. Because it only removes 's and 'S. In Turkish different character sequences may come after an apostrophe. e.g. 'nin, 'a, 'nin, 'ü etc.

        In Turkish, apostrophe is used to separate suffixes from proper names (continent, sea, river, lake, mountain, upland, proper names related to religion and mythology). For example Van Gölü’ne (meaning: to Lake Van).

        Show
        Ahmet Arslan added a comment - This is similar to ClassicFilter that removes 's from the end of words. But ClassicFilter is useful for English language only and has nothing to do with Turkish. Because it only removes 's and 'S. In Turkish different character sequences may come after an apostrophe. e.g. 'nin, 'a, 'nin, 'ü etc. In Turkish, apostrophe is used to separate suffixes from proper names (continent, sea, river, lake, mountain, upland, proper names related to religion and mythology). For example Van Gölü’ne (meaning: to Lake Van).
        Hide
        Robert Muir added a comment -

        +1, i saw your paper (very nice) on this and think it would be a great addition to lucene!

        Show
        Robert Muir added a comment - +1, i saw your paper (very nice) on this and think it would be a great addition to lucene!
        Hide
        Ahmet Arslan added a comment -

        This patch adds a new TokenFilter named ApostropheFilter.

        Show
        Ahmet Arslan added a comment - This patch adds a new TokenFilter named ApostropheFilter.
        Hide
        Ahmet Arslan added a comment -

        Thank you for your interest Robert Muir ! Here is the paper in case anyone interested. It's more like a solr writeup though.

        Show
        Ahmet Arslan added a comment - Thank you for your interest Robert Muir ! Here is the paper in case anyone interested. It's more like a solr writeup though.
        Hide
        Uwe Schindler added a comment -

        Hi,
        your patch contains unrelated changes in analysis' modules root folder (adding of a useless classpath). Can you fix this?
        Also, because you add new functionality, TurkishAnalyzer should only add the new TokenFilter, if matchVersion is at least LUCENE_48.

        Show
        Uwe Schindler added a comment - Hi, your patch contains unrelated changes in analysis' modules root folder (adding of a useless classpath). Can you fix this? Also, because you add new functionality, TurkishAnalyzer should only add the new TokenFilter, if matchVersion is at least LUCENE_48.
        Hide
        Ahmet Arslan added a comment -

        It is possible to achieve described behavior with following existing filters. (without a custom filter) Any thoughts on which way is preferred?

         <filter class="solr.PatternReplaceFilterFactory" pattern="(.*)'(.*)" replacement="$1"/>
        
         <filter class="solr.PatternCaptureGroupFilterFactory" pattern="(.*)'" preserve_original="false" />
        
        Show
        Ahmet Arslan added a comment - It is possible to achieve described behavior with following existing filters. (without a custom filter) Any thoughts on which way is preferred? <filter class= "solr.PatternReplaceFilterFactory" pattern= "(.*)'(.*)" replacement= "$1" /> <filter class= "solr.PatternCaptureGroupFilterFactory" pattern= "(.*)'" preserve_original= "false" />
        Hide
        Robert Muir added a comment -

        I prefer the explicit filter you have now!

        Show
        Robert Muir added a comment - I prefer the explicit filter you have now!
        Hide
        Uwe Schindler added a comment - - edited

        This should also work:

        <filter class="solr.PatternReplaceFilterFactory" pattern="'(.*)" replacement=""/>
        
        Show
        Uwe Schindler added a comment - - edited This should also work: <filter class= "solr.PatternReplaceFilterFactory" pattern= "'(.*)" replacement=""/>
        Hide
        Ahmet Arslan added a comment -

        Thanks for looking into this Uwe Schindler. I wanted to use QueryParser in TestTurkishAnalyzer.java but I am not familiar with ant. I want to include a checkMatch(String text, String qString) method that checks this : "this query string" should retrieve "this document text"

        I added this but not sure this is correct.

           
        <path id="classpath">
            <path refid="base.classpath"/>
            <pathelement path="${queryparser.jar}"/>
          </path>
        
        Show
        Ahmet Arslan added a comment - Thanks for looking into this Uwe Schindler . I wanted to use QueryParser in TestTurkishAnalyzer.java but I am not familiar with ant. I want to include a checkMatch(String text, String qString) method that checks this : "this query string" should retrieve "this document text" I added this but not sure this is correct. <path id= "classpath" > <path refid= "base.classpath" /> <pathelement path= "${queryparser.jar}" /> </path>
        Hide
        Robert Muir added a comment -

        Generally speaking its enough to just do assertAnalyzesTo/tokenStreamContents in unit tests. it keeps everything simple and easier to debug than integration-like tests.

        Thats why we don't depend on queryparser in any of the tests today.

        Show
        Robert Muir added a comment - Generally speaking its enough to just do assertAnalyzesTo/tokenStreamContents in unit tests. it keeps everything simple and easier to debug than integration-like tests. Thats why we don't depend on queryparser in any of the tests today.
        Hide
        Uwe Schindler added a comment -

        We should not add an additional dependency to the query parser module! I would remove this test, we generally don't add such type of tests. Use BaseTokenStreamTestCase as base class for your test and use the various assert methods to check if the token stream is what you expect. Feeding IndexWriter with your tokens and executing a search is not really a "unit test" anymore. We have enough tests for the indexing.

        Show
        Uwe Schindler added a comment - We should not add an additional dependency to the query parser module! I would remove this test, we generally don't add such type of tests. Use BaseTokenStreamTestCase as base class for your test and use the various assert methods to check if the token stream is what you expect. Feeding IndexWriter with your tokens and executing a search is not really a "unit test" anymore. We have enough tests for the indexing.
        Hide
        Ahmet Arslan added a comment -

        useless class path chance and test case removed.

        Show
        Ahmet Arslan added a comment - useless class path chance and test case removed.
        Hide
        Robert Muir added a comment -

        This looks great Ahmet: As Uwe mentioned, i think the only change we need is the condition in TurkishAnalyzer:

        if matchVersion.onOrAfter(Version.LUCENE_48) {
         // do new stuff, include the new filter
        } else {
         // do old stuff
        }
        

        Otherwise, this change looks ready to me.

        Show
        Robert Muir added a comment - This looks great Ahmet: As Uwe mentioned, i think the only change we need is the condition in TurkishAnalyzer: if matchVersion.onOrAfter(Version.LUCENE_48) { // do new stuff, include the new filter } else { // do old stuff } Otherwise, this change looks ready to me.
        Hide
        Robert Muir added a comment -

        Oh one other thing that would be nice, if you could add some javadocs to the public classes?

        The factories typically have an example of its use (see some of the others). For the filter itself, maybe just a simple description of what it does, and a reference to your paper would be good (since you have done experiments and so on).

        Show
        Robert Muir added a comment - Oh one other thing that would be nice, if you could add some javadocs to the public classes? The factories typically have an example of its use (see some of the others). For the filter itself, maybe just a simple description of what it does, and a reference to your paper would be good (since you have done experiments and so on).
        Hide
        Ahmet Arslan added a comment -

        if matchVersion.onOrAfter(Version.LUCENE_48)

        I tried this but there is no LUCENE_48 in trunk.

        Show
        Ahmet Arslan added a comment - if matchVersion.onOrAfter(Version.LUCENE_48) I tried this but there is no LUCENE_48 in trunk.
        Hide
        Robert Muir added a comment -

        Thats a bug. I will take care of it right now!

        Show
        Robert Muir added a comment - Thats a bug. I will take care of it right now!
        Hide
        ASF subversion and git services added a comment -

        Commit 1573059 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1573059 ]

        LUCENE-5482: add missing constant

        Show
        ASF subversion and git services added a comment - Commit 1573059 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1573059 ] LUCENE-5482 : add missing constant
        Hide
        ASF subversion and git services added a comment -

        Commit 1573061 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1573061 ]

        LUCENE-5482: remove wrong text from this, its not the latest

        Show
        ASF subversion and git services added a comment - Commit 1573061 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1573061 ] LUCENE-5482 : remove wrong text from this, its not the latest
        Hide
        Robert Muir added a comment -

        Thanks for pointing that out, you should see the constant now.

        Show
        Robert Muir added a comment - Thanks for pointing that out, you should see the constant now.
        Hide
        Ahmet Arslan added a comment -

        Java doc for public classes added
        Version.LUCENE_48 check added to TurkishAnalyzer

        Show
        Ahmet Arslan added a comment - Java doc for public classes added Version.LUCENE_48 check added to TurkishAnalyzer
        Hide
        Ahmet Arslan added a comment -

        Should we add this if check to TestTurkishAnalyzer too?

         if(matchVersion.onOrAfter(Version.LUCENE_48))   
         // check apostrophes 
        
        Show
        Ahmet Arslan added a comment - Should we add this if check to TestTurkishAnalyzer too? if (matchVersion.onOrAfter(Version.LUCENE_48)) // check apostrophes
        Hide
        Robert Muir added a comment -

        No its ok, because we only instantiate analyzers with the latest version

        Show
        Robert Muir added a comment - No its ok, because we only instantiate analyzers with the latest version
        Hide
        Ahmet Arslan added a comment -

        Great, Thanks for guidance and comments!

        Show
        Ahmet Arslan added a comment - Great, Thanks for guidance and comments!
        Hide
        ASF subversion and git services added a comment -

        Commit 1573066 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1573066 ]

        LUCENE-5482: Improve default TurkishAnalyzer

        Show
        ASF subversion and git services added a comment - Commit 1573066 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1573066 ] LUCENE-5482 : Improve default TurkishAnalyzer
        Hide
        Uwe Schindler added a comment -

        Cool, thanks!
        +1 to commit

        Show
        Uwe Schindler added a comment - Cool, thanks! +1 to commit
        Hide
        ASF subversion and git services added a comment -

        Commit 1573074 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1573074 ]

        LUCENE-5482: Improve default TurkishAnalyzer

        Show
        ASF subversion and git services added a comment - Commit 1573074 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1573074 ] LUCENE-5482 : Improve default TurkishAnalyzer
        Hide
        Robert Muir added a comment -

        Thanks Ahmet!

        I made one addition: I also inserted this filter into the text_tr chain in the solr example.

        Show
        Robert Muir added a comment - Thanks Ahmet! I made one addition: I also inserted this filter into the text_tr chain in the solr example.
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Robert Muir
            Reporter:
            Ahmet Arslan
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development