Lucene - Core
  1. Lucene - Core
  2. LUCENE-1343

A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all. Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L .

      The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł -> L )

      1. UnicodeNormalizationFilterFactory.java
        9 kB
        Robert Haschart
      2. UnicodeNormalizationFilter.java
        4 kB
        Robert Haschart
      3. UnicodeCharUtil.java
        25 kB
        Robert Haschart
      4. normalizer.jar
        390 kB
        Robert Haschart
      5. LUCENE-1343.patch
        176 kB
        Robert Muir
      6. utr30.nrm
        41 kB
        Robert Muir
      7. LUCENE-1343.patch
        183 kB
        Robert Muir
      8. utr30.nrm
        41 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Haschart added a comment -

          Source code for UnicodeNormalizationFilter

          Show
          Robert Haschart added a comment - Source code for UnicodeNormalizationFilter
          Hide
          Robert Haschart added a comment -

          Java 6 contains a class named java.text.Normalizer that is able to perform Unicode normalization, earlier versions of java do not have that class, and therefore need the code in this jar (which is a subset of the icu4j library) to be able to perform Unicode normalization. The UnicodeNormalizationFilter can work with either the java 6 class java.text.Normalizer or the class com.ibm.icu.text.Normalizer in the jar here.

          Show
          Robert Haschart added a comment - Java 6 contains a class named java.text.Normalizer that is able to perform Unicode normalization, earlier versions of java do not have that class, and therefore need the code in this jar (which is a subset of the icu4j library) to be able to perform Unicode normalization. The UnicodeNormalizationFilter can work with either the java 6 class java.text.Normalizer or the class com.ibm.icu.text.Normalizer in the jar here.
          Hide
          Hoss Man added a comment -

          Random related comment (just because this issue seemed like a good place to put it)

          People may also want to consider constructing a Filter based on the substitution tables from the perl Text::Unidecode module...

          http://search.cpan.org/~sburke/Text-Unidecode/
          http://interglacial.com/~sburke/tpj/as_html/tpj22.html

          ...i have no idea how it's behavior compares to the UnicodeNormalizationFilter, just that it seems to have similar goals.

          Show
          Hoss Man added a comment - Random related comment (just because this issue seemed like a good place to put it) People may also want to consider constructing a Filter based on the substitution tables from the perl Text::Unidecode module... http://search.cpan.org/~sburke/Text-Unidecode/ http://interglacial.com/~sburke/tpj/as_html/tpj22.html ...i have no idea how it's behavior compares to the UnicodeNormalizationFilter, just that it seems to have similar goals.
          Hide
          Steve Rowe added a comment -

          Hi Robert,

          My comments below assume you're intrestested in having this code hosted in the Lucene source repository - please disregard if that's not the case.

          Have you seen the HowToContribute page on the Lucene wiki? It outlines some of the basics concerning code submissions.

          A couple of things I noticed that need to be addressed before the code will be accepted:

          1. Tab characters should be converted to spaces
          2. Indentation increment should be two spaces
          3. Test(s) should be moved from the UnicodeNormalizationFilterFactory.main() method into standalone class(es) that extend LuceneTestCase
          4. More/more explicit javadocs - for example, you should describe the set of provided transformations (e.g. Cyrillic diacritic stripping is included).
          5. Solr is a separate code base, so the UnicodeNormalizationFilterFactory should be moved to a Solr JIRA issue
          6. Because it has a dependency on the ICU jar, this contribution will have to live in the contrib/ area – the Java packages name should be adjusted accordingly.
          7. The submission should be repackaged as a patch (instructions available on the above-linked wiki page).
          Show
          Steve Rowe added a comment - Hi Robert, My comments below assume you're intrestested in having this code hosted in the Lucene source repository - please disregard if that's not the case. Have you seen the HowToContribute page on the Lucene wiki ? It outlines some of the basics concerning code submissions. A couple of things I noticed that need to be addressed before the code will be accepted: Tab characters should be converted to spaces Indentation increment should be two spaces Test(s) should be moved from the UnicodeNormalizationFilterFactory.main() method into standalone class(es) that extend LuceneTestCase More/more explicit javadocs - for example, you should describe the set of provided transformations (e.g. Cyrillic diacritic stripping is included). Solr is a separate code base, so the UnicodeNormalizationFilterFactory should be moved to a Solr JIRA issue Because it has a dependency on the ICU jar, this contribution will have to live in the contrib/ area – the Java packages name should be adjusted accordingly. The submission should be repackaged as a patch (instructions available on the above-linked wiki page).
          Hide
          Lance Norskog added a comment -

          Some languages like Cyrillic have a standard latin-1 transliteration, and deserve their own filters.

          Cyrillic is one case of this. It is based on three alphabets: 1/3 latin, 1/3 greek, and 1/3 new characters for 'ya/ye', 'ts', 'sh', 'ch', 'zh', and 'sh-ch' (fiSH CHips!).

          Unit tests are the best way to document the many ways this thing can work.

          Show
          Lance Norskog added a comment - Some languages like Cyrillic have a standard latin-1 transliteration, and deserve their own filters. Cyrillic is one case of this. It is based on three alphabets: 1/3 latin, 1/3 greek, and 1/3 new characters for 'ya/ye', 'ts', 'sh', 'ch', 'zh', and 'sh-ch' (fiSH CHips!). Unit tests are the best way to document the many ways this thing can work.
          Hide
          Ken Krugler added a comment -

          Hi Robert,

          FWIW, the issues being discussed here are very similar to those covered by the Unicode Security Considerations technical report #36, and associated data found in the Unicode Security Mechanisms technical report #39.

          The fundamental issue for int'l domain name spoofing is detecting when two sequences of Unicode code points will render as similar glyphs...which is basically the same issue you're trying to address here, so that when you search for something you'll find all terms that "look" similar.

          So for a more complete (though undoubtedly slower & bigger) solution, I'd suggest using ICU4J to do a NFKD normalization, then toss any combining/spacing marks, lower-case the result, and finally apply mappings using the data tables found in the technical report #39 referenced above.

          – Ken

          Show
          Ken Krugler added a comment - Hi Robert, FWIW, the issues being discussed here are very similar to those covered by the Unicode Security Considerations technical report #36, and associated data found in the Unicode Security Mechanisms technical report #39. The fundamental issue for int'l domain name spoofing is detecting when two sequences of Unicode code points will render as similar glyphs...which is basically the same issue you're trying to address here, so that when you search for something you'll find all terms that "look" similar. So for a more complete (though undoubtedly slower & bigger) solution, I'd suggest using ICU4J to do a NFKD normalization, then toss any combining/spacing marks, lower-case the result, and finally apply mappings using the data tables found in the technical report #39 referenced above. – Ken
          Hide
          Erik Hatcher added a comment -

          Unit tests are the best way to document the many ways this thing can work.

          gets a judges score of 11 from me. Gold for Lance for Quote of the Day.

          Show
          Erik Hatcher added a comment - Unit tests are the best way to document the many ways this thing can work. gets a judges score of 11 from me. Gold for Lance for Quote of the Day.
          Hide
          Robert Haschart added a comment -

          The UnicodeNormalizationFilter does use the decompose normalization
          portion of the icu4j library as a starting point. However even with
          that there are several instances where the normalizer code does not
          decompose a character into an unaccented character and a accent mark, a
          notable one being ( Ł -> L ) so the UnicodeNormalizationFilter start
          with the approach you outlined, perform a decompose normalization
          followed by discarding all non-spacing modifier characters, and then can
          go on from there to further normalize the data by folding the additional
          characters that aren't handled by the decompose normalization onto their
          Latin1 lookalikes.

          -Robert

          Show
          Robert Haschart added a comment - The UnicodeNormalizationFilter does use the decompose normalization portion of the icu4j library as a starting point. However even with that there are several instances where the normalizer code does not decompose a character into an unaccented character and a accent mark, a notable one being ( Ł -> L ) so the UnicodeNormalizationFilter start with the approach you outlined, perform a decompose normalization followed by discarding all non-spacing modifier characters, and then can go on from there to further normalize the data by folding the additional characters that aren't handled by the decompose normalization onto their Latin1 lookalikes. -Robert
          Hide
          Ken Krugler added a comment -

          Hi Robert,

          So given that you and the Unicode consortium seem to be working on the same problem (normalizing visually similar characters), how similar are your tables to the ones that have been developed to deter spoofing of int'l domain names?

          – Ken

          Show
          Ken Krugler added a comment - Hi Robert, So given that you and the Unicode consortium seem to be working on the same problem (normalizing visually similar characters), how similar are your tables to the ones that have been developed to deter spoofing of int'l domain names? – Ken
          Hide
          Mark Miller added a comment -

          Mr Muir, can you take a look at this? Offer anything over the ASCIIFoldingFilter? If not, we should close, if so, what do you recommend?

          Show
          Mark Miller added a comment - Mr Muir, can you take a look at this? Offer anything over the ASCIIFoldingFilter? If not, we should close, if so, what do you recommend?
          Hide
          Robert Muir added a comment -

          The big picture here and all these other duplicated normalization issues across jira is related to the outdated unicode support in the JDK.

          This issue speaks of removing diacritical marks / NSM's, but the underlying issue is missing unicode normalization, duplicated here (incorrectly named): LUCENE-1215 and also here: LUCENE-1488 (disclaimer: my impl)

          Speaking for the accent removal: In truth I do not think we should be simply removing NSMs because in most cases, they are there for a reason. For example, they are diacritics in a lot of european languages, but for many eastern languages they are the actual vowels. (i.e. all the indic scripts)

          We need to separate the issue of missing unicode normalization (which is clearly something lucene needs), from the issue of removing diacritics (which is language-specific and doing it based on unicode properties is inappropriate).

          Finally just normalizing unicode in Lucene by itself is not very useful, because there is a careful interaction with other processes and attention needs to be paid to the order in which filters are run. For example, its interaction with case folding can be a bit tricky. If you are interested in this issue I urge you to read the javadocs writeup I placed in the ICUNormalizationFilter in LUCENE-1488.

          Show
          Robert Muir added a comment - The big picture here and all these other duplicated normalization issues across jira is related to the outdated unicode support in the JDK. This issue speaks of removing diacritical marks / NSM's, but the underlying issue is missing unicode normalization, duplicated here (incorrectly named): LUCENE-1215 and also here: LUCENE-1488 (disclaimer: my impl) Speaking for the accent removal: In truth I do not think we should be simply removing NSMs because in most cases, they are there for a reason. For example, they are diacritics in a lot of european languages, but for many eastern languages they are the actual vowels. (i.e. all the indic scripts) We need to separate the issue of missing unicode normalization (which is clearly something lucene needs), from the issue of removing diacritics (which is language-specific and doing it based on unicode properties is inappropriate). Finally just normalizing unicode in Lucene by itself is not very useful, because there is a careful interaction with other processes and attention needs to be paid to the order in which filters are run. For example, its interaction with case folding can be a bit tricky. If you are interested in this issue I urge you to read the javadocs writeup I placed in the ICUNormalizationFilter in LUCENE-1488 .
          Hide
          Ken Krugler added a comment -

          Just to make sure this point doesn't get lost in the discussion over normalization - the issue of "visual normalization" is one that I think ISOLatin1AccentFilter originally was trying to address. Specifically how to fold together forms of letters that a user, when typing, might consider equivalent.

          This is indeed language specific, and re-implementing support that's already in ICU4J is clearly a Bad Idea.

          I think there's value in a general normalizer that implements the Unicode Consortium's algorithm/data for normalization of int'l domain names, as this is intended to avoid visual spoofing of domain names.

          Don't know/haven't tracked if or when this is going into ICU4J. But (similar to ICU generic sorting) it provides a useful locale-agnostic approach that would work well-enough for most Lucene use cases.

          Show
          Ken Krugler added a comment - Just to make sure this point doesn't get lost in the discussion over normalization - the issue of "visual normalization" is one that I think ISOLatin1AccentFilter originally was trying to address. Specifically how to fold together forms of letters that a user, when typing, might consider equivalent. This is indeed language specific, and re-implementing support that's already in ICU4J is clearly a Bad Idea. I think there's value in a general normalizer that implements the Unicode Consortium's algorithm/data for normalization of int'l domain names, as this is intended to avoid visual spoofing of domain names. Don't know/haven't tracked if or when this is going into ICU4J. But (similar to ICU generic sorting) it provides a useful locale-agnostic approach that would work well-enough for most Lucene use cases.
          Hide
          Robert Muir added a comment -

          Hi Ken, such functionality does exist, although it is new and I think still changing (you are talking about StringPrep/IDN/etc?).

          If a filter for this is desired, we can do it with ICU, though I think its relatively new (probably not optimized, only works on String, etc etc)

          I still think even this is stupid, because unicode encodes characters, not glyphs.

          Show
          Robert Muir added a comment - Hi Ken, such functionality does exist, although it is new and I think still changing (you are talking about StringPrep/IDN/etc?). If a filter for this is desired, we can do it with ICU, though I think its relatively new (probably not optimized, only works on String, etc etc) I still think even this is stupid, because unicode encodes characters, not glyphs.
          Hide
          DM Smith added a comment -

          I also am dubious about a general purpose folding filter that maps letters to their ASCII look-alike and agree that folding is language dependent.

          May Americans are illiterate when it comes to text with diacritics and NSM. Personally I'm nearly illiterate. I think having prominent folding filters without adequate explanation about their pitfalls or usefulness may lead illiterates into a false sense of sufficiency.

          If it makes sense to have a filter for TR39 I think that should be a separate issue. If that's what this issue is all about then it's description should be modified.

          I think this should otherwise be closed as a bad idea.

          Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not.

          Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags?

          Show
          DM Smith added a comment - I also am dubious about a general purpose folding filter that maps letters to their ASCII look-alike and agree that folding is language dependent. May Americans are illiterate when it comes to text with diacritics and NSM. Personally I'm nearly illiterate. I think having prominent folding filters without adequate explanation about their pitfalls or usefulness may lead illiterates into a false sense of sufficiency. If it makes sense to have a filter for TR39 I think that should be a separate issue. If that's what this issue is all about then it's description should be modified. I think this should otherwise be closed as a bad idea. Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not. Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags?
          Hide
          Robert Muir added a comment -

          Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not.

          The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you "want" (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?)

          Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags?

          This depends on your use case, and then you have dagesh,shin dot, too... These are all NSMs. But this is going to depend on the user, and I think every person will need their own, they can use CharFilter or other ways of defining these tables.

          Show
          Robert Muir added a comment - Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not. The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you "want" (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?) Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags? This depends on your use case, and then you have dagesh,shin dot, too... These are all NSMs. But this is going to depend on the user, and I think every person will need their own, they can use CharFilter or other ways of defining these tables.
          Hide
          DM Smith added a comment -

          Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not.

          The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you "want" (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?)

          Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Most ancient texts did not have these marks but modern do. Even some modern representations of the ancient. While I have several semesters of koine Greek under my belt and might be wrong, there may be ambiguities where two words have the same letters but differ on marks, but they are infrequent (I don't know of any).

          The GreekLowerCaseFilter appears to only do some of the work and only works on composed characters.

          My question is not whether I'd find the filter useful, but whether it'd be a useful addition to Lucene.

          Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags?

          This depends on your use case, and then you have dagesh,shin dot, too... These are all NSMs.

          I have a terrible habit of not being exact or using the proper terms. Shame on me. I meant that the latter strip all other marks.

          But this is going to depend on the user, and I think every person will need their own, they can use CharFilter or other ways of defining these tables.

          If there is no general purpose contribution, then it should not be part of Lucene and I'll have my own.

          When I do work them up, I'll create an issue or two and attach the results. If they are deemed useful then they can be added to Lucene, otherwise ignored.

          Show
          DM Smith added a comment - Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not. The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you "want" (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?) Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Most ancient texts did not have these marks but modern do. Even some modern representations of the ancient. While I have several semesters of koine Greek under my belt and might be wrong, there may be ambiguities where two words have the same letters but differ on marks, but they are infrequent (I don't know of any). The GreekLowerCaseFilter appears to only do some of the work and only works on composed characters. My question is not whether I'd find the filter useful, but whether it'd be a useful addition to Lucene. Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags? This depends on your use case, and then you have dagesh,shin dot, too... These are all NSMs. I have a terrible habit of not being exact or using the proper terms. Shame on me. I meant that the latter strip all other marks. But this is going to depend on the user, and I think every person will need their own, they can use CharFilter or other ways of defining these tables. If there is no general purpose contribution, then it should not be part of Lucene and I'll have my own. When I do work them up, I'll create an issue or two and attach the results. If they are deemed useful then they can be added to Lucene, otherwise ignored.
          Hide
          Robert Muir added a comment -

          Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Most ancient texts did not have these marks but modern do. Even some modern representations of the ancient. While I have several semesters of koine Greek under my belt and might be wrong, there may be ambiguities where two words have the same letters but differ on marks, but they are infrequent (I don't know of any).

          I guess I brought this up because this is where you have several situations where case folding and normalization interact, eg. applying FC_NFKC set when case folding so that later NFK[CD] normalization will be closed, I know this is supposed to solve various ways the YPOGEGRAMMENI can be implemented but I forget the details...

          This is why I think, the general purpose contribution should be case folding, normalization, and the stuff like this (FC_NFKC set) to make sure they work together...

          If you later want to apply something more specialized like StringPrep, you need this logic anyway, see http://www.ietf.org/rfc/rfc3454.txt (especially section 3.2)

          Show
          Robert Muir added a comment - Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Most ancient texts did not have these marks but modern do. Even some modern representations of the ancient. While I have several semesters of koine Greek under my belt and might be wrong, there may be ambiguities where two words have the same letters but differ on marks, but they are infrequent (I don't know of any). I guess I brought this up because this is where you have several situations where case folding and normalization interact, eg. applying FC_NFKC set when case folding so that later NFK [CD] normalization will be closed, I know this is supposed to solve various ways the YPOGEGRAMMENI can be implemented but I forget the details... This is why I think, the general purpose contribution should be case folding, normalization, and the stuff like this (FC_NFKC set) to make sure they work together... If you later want to apply something more specialized like StringPrep, you need this logic anyway, see http://www.ietf.org/rfc/rfc3454.txt (especially section 3.2)
          Hide
          Robert Muir added a comment -

          OK! I think we have a good solution here!.

          We can use ICU's Normalizer2 to implement this, by simply creating a custom normalization mapping.
          This way we can meet multiple use-cases, e.g. someone wants to remove diacritics, someone else doesn't.

          And we get solid unicode behavior and high performance to boot.

          So I will keep this issue open, I think the best solution is to take the accent-folding mappings here (or use the ones in AsciiFoldingFilter?) and create a .txt file of mappings, passing it to gennorm2 along with NFKC case fold mappings.

          This way we can implement this on top of LUCENE-2399, all compiled to an efficient binary form with no code.
          I'll take a shot at this once LUCENE-2399 is resolved.

          Show
          Robert Muir added a comment - OK! I think we have a good solution here!. We can use ICU's Normalizer2 to implement this, by simply creating a custom normalization mapping. This way we can meet multiple use-cases, e.g. someone wants to remove diacritics, someone else doesn't. And we get solid unicode behavior and high performance to boot. So I will keep this issue open, I think the best solution is to take the accent-folding mappings here (or use the ones in AsciiFoldingFilter?) and create a .txt file of mappings, passing it to gennorm2 along with NFKC case fold mappings. This way we can implement this on top of LUCENE-2399 , all compiled to an efficient binary form with no code. I'll take a shot at this once LUCENE-2399 is resolved.
          Hide
          Robert Muir added a comment -

          Attached is a patch that implements UTR#30 as a tailored unicode normalization form.

          Essentially it acts as a combined "Internationalized AsciiFoldingFilter" + NFKC_CaseFold (Unicode Case Folding, Default Ignorable removal, and NFKC normalization).

          This is a nice alternative to just using ICUNormalizer2Filter in the case that you want "fuzzy matching" (e.g. ignore diacritical marks).

          The patch is large because it contains all the source data files necessary for gennorm2 to regenerate the 41KB binary trie file... the java implementation is trivial.

          Show
          Robert Muir added a comment - Attached is a patch that implements UTR#30 as a tailored unicode normalization form. Essentially it acts as a combined "Internationalized AsciiFoldingFilter" + NFKC_CaseFold (Unicode Case Folding, Default Ignorable removal, and NFKC normalization). This is a nice alternative to just using ICUNormalizer2Filter in the case that you want "fuzzy matching" (e.g. ignore diacritical marks). The patch is large because it contains all the source data files necessary for gennorm2 to regenerate the 41KB binary trie file... the java implementation is trivial.
          Hide
          Robert Muir added a comment -

          attached is the binary file that goes in the resources/ directory.

          Although I provide the ant logic to regenerate this, its kind of a pain because

          • you must download/compile ICU4c (version 4.4), there is no java gennorm2
          • you must run this on a big-endian machine.
          Show
          Robert Muir added a comment - attached is the binary file that goes in the resources/ directory. Although I provide the ant logic to regenerate this, its kind of a pain because you must download/compile ICU4c (version 4.4), there is no java gennorm2 you must run this on a big-endian machine.
          Hide
          Robert Muir added a comment -

          By the way, I have been running this with the ASCIIFoldingFilter tests and ensuring its a superset (e.g. we have at least all their mappings).

          But there are some bugs in ASCIIFoldingFilter that should be fixed:

          For example, U+1E9B (LATIN SMALL LETTER LONG S WITH DOT ABOVE)
          But in unicode. this is canonically equivalent to U+017F (LONG S) U+0307 (COMBINING DOT ABOVE)
          AsciiFoldingFilter folds U+1E9B (LONG S WITH DOT) to an F
          but it folds U+017F (LONG S) to an S

          Unicode defines this character as a compatibility equivalent to S anyway, but its worse that ASCIIFoldingFilter is canonically inconsistent with itself.

          Show
          Robert Muir added a comment - By the way, I have been running this with the ASCIIFoldingFilter tests and ensuring its a superset (e.g. we have at least all their mappings). But there are some bugs in ASCIIFoldingFilter that should be fixed: For example, U+1E9B (LATIN SMALL LETTER LONG S WITH DOT ABOVE) But in unicode. this is canonically equivalent to U+017F (LONG S) U+0307 (COMBINING DOT ABOVE) AsciiFoldingFilter folds U+1E9B (LONG S WITH DOT) to an F but it folds U+017F (LONG S) to an S Unicode defines this character as a compatibility equivalent to S anyway, but its worse that ASCIIFoldingFilter is canonically inconsistent with itself.
          Hide
          Robert Muir added a comment -

          attached is a modified patch (i will upload the new datafile too).

          • applied ICU or Unicode copyright headers to any datafiles where I sourced from their data, and added a mention to NOTICE.txt to that effect.
          • added some additional punctuation mappings to ensure it contains all ASCIIFoldingFilter foldings

          As noted previously, there are 5 places where this disagrees with ASCIIFoldingFilter:
          U+1E9B: LATIN SMALL LETTER LONG S WITH DOT ABOVE (should be s)
          U+2033: DOUBLE PRIME (should be two single quotes)
          U+2036: REVERSED DOUBLE PRIME (same as above)
          U+2038: CARET (folds to CIRCUMFLEX ACCENT, which should be deleted as its [:Diacritic:]
          U+FF3E: FULLWIDTH CIRCUMFLEX ACCENT (same as above)

          I plan to commit in a few days if no one objects.

          Show
          Robert Muir added a comment - attached is a modified patch (i will upload the new datafile too). applied ICU or Unicode copyright headers to any datafiles where I sourced from their data, and added a mention to NOTICE.txt to that effect. added some additional punctuation mappings to ensure it contains all ASCIIFoldingFilter foldings As noted previously, there are 5 places where this disagrees with ASCIIFoldingFilter: U+1E9B: LATIN SMALL LETTER LONG S WITH DOT ABOVE (should be s) U+2033: DOUBLE PRIME (should be two single quotes) U+2036: REVERSED DOUBLE PRIME (same as above) U+2038: CARET (folds to CIRCUMFLEX ACCENT, which should be deleted as its [:Diacritic:] U+FF3E: FULLWIDTH CIRCUMFLEX ACCENT (same as above) I plan to commit in a few days if no one objects.
          Hide
          Robert Muir added a comment -

          updated datafile.

          Show
          Robert Muir added a comment - updated datafile.
          Hide
          Robert Muir added a comment -

          Committed revision 936657.

          Show
          Robert Muir added a comment - Committed revision 936657.
          Hide
          Jamie added a comment -

          Very useful for unicode normalization/folding. But after trying this package in the nightly build I looked back at the patch and realized that it has a dependency on IBM ICU.

          import com.ibm.icu.text.Normalizer2;

          Is this intentional? Will it remain dependent?

          Show
          Jamie added a comment - Very useful for unicode normalization/folding. But after trying this package in the nightly build I looked back at the patch and realized that it has a dependency on IBM ICU. import com.ibm.icu.text.Normalizer2; Is this intentional? Will it remain dependent?
          Hide
          Uwe Schindler added a comment -

          Yes, as this contrib package is called "ICU". If you dont want to use ICU, dont use this contrib. You can alway use ASCIIFoldingFilter, it will not get removed.

          Show
          Uwe Schindler added a comment - Yes, as this contrib package is called "ICU". If you dont want to use ICU, dont use this contrib. You can alway use ASCIIFoldingFilter, it will not get removed.
          Hide
          Robert Muir added a comment -

          backported to 3x, revision 941694.

          Show
          Robert Muir added a comment - backported to 3x, revision 941694.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Haschart
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development