Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1343

A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all. Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L .

      The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł -> L )

        Attachments

        1. UnicodeNormalizationFilterFactory.java
          9 kB
          Robert Haschart
        2. UnicodeNormalizationFilter.java
          4 kB
          Robert Haschart
        3. UnicodeCharUtil.java
          25 kB
          Robert Haschart
        4. normalizer.jar
          390 kB
          Robert Haschart
        5. LUCENE-1343.patch
          176 kB
          Robert Muir
        6. utr30.nrm
          41 kB
          Robert Muir
        7. LUCENE-1343.patch
          183 kB
          Robert Muir
        8. utr30.nrm
          41 kB
          Robert Muir

          Issue Links

            Activity

              People

              • Assignee:
                rcmuir Robert Muir
                Reporter:
                haschart Robert Haschart
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: