Lucene - Core
  1. Lucene - Core
  2. LUCENE-3748

EnglishPossessiveFilter should work with Unicode right single quotation mark

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1, 3.2, 3.4, 3.5
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The current EnglishPossessiveFilter (used in EnglishAnalyzer) removes possessives using only the '\'' character (plus 's' or 'S'), but some common systems (German?) insert the Unicode "\u2019" (RIGHT SINGLE QUOTATION MARK) instead and this is not removed when processing UTF-8 text. I propose to change EnglishPossesiveFilter to support '\u2019' as an alternative to '\''.

      1. LUCENE-3748.patch
        5 kB
        Robert Muir
      2. Patch-Lucene-3748
        3 kB
        David Croley
      3. LucenePatch
        2 kB
        David Croley

        Activity

        Hide
        David Croley added a comment -

        patch to address bug and add unit test for same.

        Show
        David Croley added a comment - patch to address bug and add unit test for same.
        Hide
        Robert Muir added a comment -

        I agree with the patch. We can easily add backwards compat here, no problem.

        As far as any potential others, the only possibility from my perspective is U+FF07 FULLWIDTH APOSTROPHE,
        though I could go either way on that (since its a compatibility character)

        Any other opinions?

        Show
        Robert Muir added a comment - I agree with the patch. We can easily add backwards compat here, no problem. As far as any potential others, the only possibility from my perspective is U+FF07 FULLWIDTH APOSTROPHE, though I could go either way on that (since its a compatibility character) Any other opinions?
        Hide
        Steve Rowe added a comment -

        +1, and +1 to include U+FF07.

        There are several other characters listed with U+0027 APOSTROPHE in http://www.unicode.org/charts/PDF/U0000.pdf that could be interpreted visually as an English apostrophe, e.g. U+02BC MODIFIER LETTER APOSTROPHE, but it would be unusual for people to use those characters as apostrophes in English text, so I think it would be fine to exclude them. (By contrast, the Unicode standard says that U+2019 is the preferred apostrophe form.)

        Show
        Steve Rowe added a comment - +1, and +1 to include U+FF07. There are several other characters listed with U+0027 APOSTROPHE in http://www.unicode.org/charts/PDF/U0000.pdf that could be interpreted visually as an English apostrophe, e.g. U+02BC MODIFIER LETTER APOSTROPHE, but it would be unusual for people to use those characters as apostrophes in English text, so I think it would be fine to exclude them. (By contrast, the Unicode standard says that U+2019 is the preferred apostrophe form.)
        Hide
        Robert Muir added a comment -

        Thats my thoughts exactly Steven.

        I think by default we should go with U+0027 and U+2019 (and as i mentioned, either FF07 or not, its less important).

        As far as other look-alikes, sure it could happen, BUT the user could just place ASCIIFoldingFilter before
        EnglishPossessiveFilter if they want that more brutal behavior... thats a more lossy normalization that I
        don't think we should do by default...

        Show
        Robert Muir added a comment - Thats my thoughts exactly Steven. I think by default we should go with U+0027 and U+2019 (and as i mentioned, either FF07 or not, its less important). As far as other look-alikes, sure it could happen, BUT the user could just place ASCIIFoldingFilter before EnglishPossessiveFilter if they want that more brutal behavior... thats a more lossy normalization that I don't think we should do by default...
        Hide
        David Croley added a comment -

        If you want to preserve backwards compatibility, I guess I could pass matchVersion in from the calling Analyzer, but that crufts it up a bit. Is that necessary?

        Show
        David Croley added a comment - If you want to preserve backwards compatibility, I guess I could pass matchVersion in from the calling Analyzer, but that crufts it up a bit. Is that necessary?
        Hide
        Robert Muir added a comment -

        I think we should do it (despite the cruft).

        One of these days we will realize our goal of a stable interface between indexwriter etc and analyzers such
        that if you are really worried about this with old indexes, you just use lucene-analyzers-ancient-version.jar
        and it works with the newer lucene-core.jar

        But until then, i think we need it (e.g. we add a deprecated ctor for api compatibility that forwards to VERSION.LUCENE_35)
        and conditionalize the handling based on Version.

        If you dont want to cruft-it-up lemme know, otherwise feel free to add a patch

        Show
        Robert Muir added a comment - I think we should do it (despite the cruft). One of these days we will realize our goal of a stable interface between indexwriter etc and analyzers such that if you are really worried about this with old indexes, you just use lucene-analyzers-ancient-version.jar and it works with the newer lucene-core.jar But until then, i think we need it (e.g. we add a deprecated ctor for api compatibility that forwards to VERSION.LUCENE_35) and conditionalize the handling based on Version. If you dont want to cruft-it-up lemme know, otherwise feel free to add a patch
        Hide
        David Croley added a comment -

        newer patch that preserve backwards compatibility. Not sure if I've done that the best way, so feel free to change as needed.

        Show
        David Croley added a comment - newer patch that preserve backwards compatibility. Not sure if I've done that the best way, so feel free to change as needed.
        Hide
        Walter Underwood added a comment -

        Why make separate patches for characters instead of using Unicode normalization? Converting to NFKC would also solve this for the prime character (U+2032) and any other codepoint that is equivalent.

        Compatibility normalization is designed for precisely this purpose, equivalence ignoring appearance.

        Show
        Walter Underwood added a comment - Why make separate patches for characters instead of using Unicode normalization? Converting to NFKC would also solve this for the prime character (U+2032) and any other codepoint that is equivalent. Compatibility normalization is designed for precisely this purpose, equivalence ignoring appearance.
        Hide
        Robert Muir added a comment -

        Walter: U+2019 does not decompose at all (see http://unicode.org/cldr/utility/character.jsp?a=2019&B1=Show)

        This is because its not a compatibility character of any reason, in fact its the single quote (U+0027)
        thats ambiguous, U+2019 is the correct one here.

        From a pedantic point of view, we should be forcing you to disambiguate the very ambiguous single quote (U+0027)
        on your keyboard and ONLY handling U+2019 in this filter, but I realize some people might find this opinion a
        tad extreme

        Show
        Robert Muir added a comment - Walter: U+2019 does not decompose at all (see http://unicode.org/cldr/utility/character.jsp?a=2019&B1=Show ) This is because its not a compatibility character of any reason, in fact its the single quote (U+0027) thats ambiguous, U+2019 is the correct one here. From a pedantic point of view, we should be forcing you to disambiguate the very ambiguous single quote (U+0027) on your keyboard and ONLY handling U+2019 in this filter, but I realize some people might find this opinion a tad extreme
        Hide
        Robert Muir added a comment -

        updated patch: thanks again David.

        I added some javadocs, CHANGES.txt, an assertion to the solr factory, and (somewhat reluctantly) FF07.

        Show
        Robert Muir added a comment - updated patch: thanks again David. I added some javadocs, CHANGES.txt, an assertion to the solr factory, and (somewhat reluctantly) FF07.
        Hide
        Robert Muir added a comment -

        Thanks David!

        Show
        Robert Muir added a comment - Thanks David!

          People

          • Assignee:
            Robert Muir
            Reporter:
            David Croley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development