Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      ApostropheTokenizer creates extra tokens during the analysis stage for the fields containing apostrophes. The reason for adding this is to ensure that documents that differ only by apostrophe have the same relevancy score.

      For example, if the document contains string "McDonald's", it will be tokenized as "McDonald's McDonalds". This way when the search is performed against "McDonald's" or "McDonalds" will produce similar score.

      This code handles up to two apostrophes in a token.

      To use this tokenizer add the following line in schema.xml

      <analyzer type="index">
      <filter class="org.apache.lucene.analysis.ApostropheTokenFactory"/>
      ...
      </analyzer>

        Activity

        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Hoss Man added a comment -

        Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

        email notification suppressed to prevent mass-spam
        psuedo-unique token identifying these issues: hoss20120321nofix36

        Show
        Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
        Hide
        Mauro Asprea added a comment - - edited

        I confirm this is working using the WordDelimiterFilterFactory like Robert said:

        <filter class="solr.WordDelimiterFilterFactory"
        stemEnglishPossessive="0"  
        preserveOriginal="1"
        catenateAll="1"/>      
        

        Then using Solr Admin Analysis page I get the following:
        Value: McDonald's

        Indexed Term
        McDonald's
        Mc
        Donald
        s
        McDonalds

        One thing: You have to be sure that no previous filters remove the trailing "'s". In my case I had the StandardFilterFactory which does remove tailing apostrophes.

        Show
        Mauro Asprea added a comment - - edited I confirm this is working using the WordDelimiterFilterFactory like Robert said: <filter class= "solr.WordDelimiterFilterFactory" stemEnglishPossessive= "0" preserveOriginal= "1" catenateAll= "1" /> Then using Solr Admin Analysis page I get the following: Value: McDonald's Indexed Term McDonald's Mc Donald s McDonalds One thing: You have to be sure that no previous filters remove the trailing "'s". In my case I had the StandardFilterFactory which does remove tailing apostrophes.
        Hide
        Robert Muir added a comment -

        3.4 -> 3.5

        Show
        Robert Muir added a comment - 3.4 -> 3.5
        Hide
        Robert Muir added a comment -

        Bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - Bulk move 3.2 -> 3.3
        Hide
        Hoss Man added a comment -

        Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

        A unique token for finding these 240 issues in the future: hossversioncleanup20100527

        Show
        Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
        Hide
        Otis Gospodnetic added a comment -

        Boris, please let us know if WordDelimiterFilter works for you.
        If it does not and this new code is needed, could you please:

        • add the ASL to the top
        • write a bit of javadoc (your description from this issue is good)
        • write a unit test

        Thanks for your help!

        Show
        Otis Gospodnetic added a comment - Boris, please let us know if WordDelimiterFilter works for you. If it does not and this new code is needed, could you please: add the ASL to the top write a bit of javadoc (your description from this issue is good) write a unit test Thanks for your help!
        Hide
        Noble Paul added a comment -

        at this point we are not entertaining new features for 1.4

        Show
        Noble Paul added a comment - at this point we are not entertaining new features for 1.4
        Hide
        Robert Muir added a comment -

        Sergey, have you looked at SOLR-1266?

        By using the new stemEnglishPossessive=0 option, I think you can get the same behavior with WordDelimiterFilter, if you use preserveOriginal=1 along with catenateWords=1

        Show
        Robert Muir added a comment - Sergey, have you looked at SOLR-1266 ? By using the new stemEnglishPossessive=0 option, I think you can get the same behavior with WordDelimiterFilter, if you use preserveOriginal=1 along with catenateWords=1

          People

          • Assignee:
            Unassigned
            Reporter:
            Sergey Borisov
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development