Lucene - Core
  1. Lucene - Core
  2. LUCENE-1390

add ASCIIFoldingFilter and deprecate ISOLatin1AccentFilter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      any

    • Lucene Fields:
      New, Patch Available

      Description

      The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
      It does what it does and there is no bug with it.

      It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
      See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
      See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block

      That way, all languages using roman characters are covered.
      A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

      1. ASCIIFoldingFilter.patch
        204 kB
        Andi Vajda
      2. ASCIIFoldingFilter.patch
        207 kB
        Steve Rowe
      3. ASCIIFoldingFilter.patch
        207 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Andi Vajda added a comment -

          The new ISOLatinAccentFilter class, superceding ISOLatin1AccentFilter.

          Show
          Andi Vajda added a comment - The new ISOLatinAccentFilter class, superceding ISOLatin1AccentFilter.
          Hide
          Steve Rowe added a comment -

          I think that at a minimum, the Latin Extended Additional block [U+1E00-U+1EFF] should be added, since it is included in Unicode v3.0 (the version of Unicode that Java 1.4.2 is conformant with), and it consists mainly of characters which have exact diacritic-stripped ASCII versions. Wikipedia doesn't seem to have a page for this block, and I can't find a code chart PDF for Unicode v3.0 on unicode.org, so here's the PDF for Unicode v5.1 (the latest version):

          http://www.unicode.org/charts/PDF/U1E00.pdf

          Probably should check against the Unicode 3.0 data file to make sure there haven't been any changes to this block between v3.0 and v5.1.0 - i.e., to make sure that the above-linked PDF is accurate for Unicode v3.0.

          There is one more Latin block listed in the Unicode 3.0 blocks data file: Latin Extended-B. Since many of the characters in this block don't have exact diacritic-stripped ASCII versions, maybe it could be argued that they shouldn't be included in this filter. A fair proportion of them (maybe 40%), however, do.

          Show
          Steve Rowe added a comment - I think that at a minimum, the Latin Extended Additional block [U+1E00-U+1EFF] should be added, since it is included in Unicode v3.0 (the version of Unicode that Java 1.4.2 is conformant with), and it consists mainly of characters which have exact diacritic-stripped ASCII versions. Wikipedia doesn't seem to have a page for this block, and I can't find a code chart PDF for Unicode v3.0 on unicode.org, so here's the PDF for Unicode v5.1 (the latest version): http://www.unicode.org/charts/PDF/U1E00.pdf Probably should check against the Unicode 3.0 data file to make sure there haven't been any changes to this block between v3.0 and v5.1.0 - i.e., to make sure that the above-linked PDF is accurate for Unicode v3.0. There is one more Latin block listed in the Unicode 3.0 blocks data file : Latin Extended-B . Since many of the characters in this block don't have exact diacritic-stripped ASCII versions, maybe it could be argued that they shouldn't be included in this filter. A fair proportion of them (maybe 40%), however, do.
          Hide
          Andi Vajda added a comment -

          Makes sense.

          I did look at that block and it looked much more remote from the purpose of
          this class. But you're right, many of these could be handled as well.

          And I agree that they should be handled to be able to claim to be doing a
          complete job.

          So far, I've claimed that this class handles Latin 1 and Latin Extended A
          which should cover most, if not all, european/turkish languages using latin
          script and thus goes much farther than the ISOLatin1AccentFilter in that
          respect.

          Show
          Andi Vajda added a comment - Makes sense. I did look at that block and it looked much more remote from the purpose of this class. But you're right, many of these could be handled as well. And I agree that they should be handled to be able to claim to be doing a complete job. So far, I've claimed that this class handles Latin 1 and Latin Extended A which should cover most, if not all, european/turkish languages using latin script and thus goes much farther than the ISOLatin1AccentFilter in that respect.
          Hide
          Andi Vajda added a comment -

          ISOLatinAccentFilter.java again, now with Unicode Latin Extended B as well.

          Show
          Andi Vajda added a comment - ISOLatinAccentFilter.java again, now with Unicode Latin Extended B as well.
          Hide
          Steve Rowe added a comment -

          Andi,

          What do you think of the idea of remaking this into a filter that folds all Unicode characters to their ASCII equivalents? This would be (I think) a superset of what you've done, but would also include things like fullwidth Latin characters, smaller variants of some ASCII symbols, fancy quotation marks, etc. Maybe non-Arabic decimal numeric characters could be converted to their Arabic versions.

          Show
          Steve Rowe added a comment - Andi, What do you think of the idea of remaking this into a filter that folds all Unicode characters to their ASCII equivalents? This would be (I think) a superset of what you've done, but would also include things like fullwidth Latin characters, smaller variants of some ASCII symbols, fancy quotation marks, etc. Maybe non-Arabic decimal numeric characters could be converted to their Arabic versions.
          Hide
          Andi Vajda added a comment -

          I think that would be a whole lot of typing
          Not a bad idea, still.
          I'm in the process of entering the 1E00 - 1EFF range.
          The Extended-C and D blocks also have relevant things to include but I'm
          hoping to stop at the Extended Additional block currently in progress.

          Show
          Andi Vajda added a comment - I think that would be a whole lot of typing Not a bad idea, still. I'm in the process of entering the 1E00 - 1EFF range. The Extended-C and D blocks also have relevant things to include but I'm hoping to stop at the Extended Additional block currently in progress.
          Hide
          Andi Vajda added a comment -

          Now with the u1e00 - u1eff range.

          Show
          Andi Vajda added a comment - Now with the u1e00 - u1eff range.
          Hide
          Steve Rowe added a comment -

          The Extended-C and D blocks also have relevant things to include

          These two blocks were not included in Unicode 3.0, the version supported by Java 1.4.2, which is the Java version that Lucene 2.X supports.

          Nevertheless, the ranges these two blocks occupy in Unicode 5.1 are non-characters in Unicode 3.0, so I don't think it would be a problem to add them.

          I'll take a look at adding more stuff this weekend.

          I also will add the Unicode character descriptions to the comments for each character (e.g. "LATIN CAPITAL LETTER A WITH MACRON").

          Show
          Steve Rowe added a comment - The Extended-C and D blocks also have relevant things to include These two blocks were not included in Unicode 3.0, the version supported by Java 1.4.2, which is the Java version that Lucene 2.X supports. Nevertheless, the ranges these two blocks occupy in Unicode 5.1 are non-characters in Unicode 3.0, so I don't think it would be a problem to add them. I'll take a look at adding more stuff this weekend. I also will add the Unicode character descriptions to the comments for each character (e.g. "LATIN CAPITAL LETTER A WITH MACRON").
          Hide
          Hoss Man added a comment -

          It would be good to clarify how the goals of this issue differ from LUCENE-1343 ... as an outside observer they seem to be the same thing

          Show
          Hoss Man added a comment - It would be good to clarify how the goals of this issue differ from LUCENE-1343 ... as an outside observer they seem to be the same thing
          Hide
          Steve Rowe added a comment -

          It would be good to clarify how the goals of this issue differ from LUCENE-1343 ... as an outside observer they seem to be the same thing

          LUCENE-1343 depends on either Java 6 or an ICU jar. Java 5 support (much less Java 6 support) is still a ways off, and core Lucene can't have external dependencies. IMO, this issue intends to provide core Lucene support for folding a wider set of accented characters to ASCII.

          In a moment I'll attach a patch that, in addition to the larger set of accented characters to be folded that Andi has provided, folds all Unicode characters I could find to their ASCII equivalents.

          Show
          Steve Rowe added a comment - It would be good to clarify how the goals of this issue differ from LUCENE-1343 ... as an outside observer they seem to be the same thing LUCENE-1343 depends on either Java 6 or an ICU jar. Java 5 support (much less Java 6 support) is still a ways off, and core Lucene can't have external dependencies. IMO, this issue intends to provide core Lucene support for folding a wider set of accented characters to ASCII. In a moment I'll attach a patch that, in addition to the larger set of accented characters to be folded that Andi has provided, folds all Unicode characters I could find to their ASCII equivalents.
          Hide
          Steve Rowe added a comment -

          Changes from Andi's version:

          1. Changed the name of the class to ASCIIFoldingFilter
          2. Added the Unicode chracter descriptions to comments on each character
          3. Added a test class
          4. Added several other Unicode blocks from which characters are converted to their ASCII equivalents. Added characters include digits and punctuation.

          I did not provide mappings for characters from the Math block - flattening circled plus, for example, didn't seem appropriate.

          I did provide mappings for IPA and two other phonetic character blocks, and I'm not sure whether this is appropriate. I was following what seemed to me to be the logic of Andi's mappings, and those provided by Latin1AccentFilter: convert characters to those that look like them in ASCII. As a result, e.g., the character described as "LATIN SMALL LETTER TURNED M" (U+0270) from the IPA block is mapped to "m", regardless of its actual phonetic value.

          There are lots of mappings in there now. I generated the mappings by Perl scripting over the contents of the Unicode 5.1 version of UnicodeData.txt from Unicode.org, after grep'ing e.g. for "LATIN" and "LETTER" or "DIGRAPH", etc., and then moved things around to the appropriate places by hand. I guess this is one weakness of this patch: it's large enough that manual verification is tough. It's my hope that adding the Unicode character descriptions will allow for at least improved verifiability.

          Show
          Steve Rowe added a comment - Changes from Andi's version: Changed the name of the class to ASCIIFoldingFilter Added the Unicode chracter descriptions to comments on each character Added a test class Added several other Unicode blocks from which characters are converted to their ASCII equivalents. Added characters include digits and punctuation. I did not provide mappings for characters from the Math block - flattening circled plus, for example, didn't seem appropriate. I did provide mappings for IPA and two other phonetic character blocks, and I'm not sure whether this is appropriate. I was following what seemed to me to be the logic of Andi's mappings, and those provided by Latin1AccentFilter: convert characters to those that look like them in ASCII. As a result, e.g., the character described as "LATIN SMALL LETTER TURNED M" (U+0270) from the IPA block is mapped to "m", regardless of its actual phonetic value. There are lots of mappings in there now. I generated the mappings by Perl scripting over the contents of the Unicode 5.1 version of UnicodeData.txt from Unicode.org, after grep'ing e.g. for "LATIN" and "LETTER" or "DIGRAPH", etc., and then moved things around to the appropriate places by hand. I guess this is one weakness of this patch: it's large enough that manual verification is tough. It's my hope that adding the Unicode character descriptions will allow for at least improved verifiability.
          Hide
          Steve Rowe added a comment - - edited

          Forgot to mention three additional change in ASCIIFoldingFilter.patch:

          1. Some of the added character mappings produce 3 or 4 characters - e.g. the character "⑾ " – described as "PARENTHESIZED NUMBER ELEVEN" – is mapped to '(' + '1' + '1' +')'.
          2. As a result of the increased maximum length of each mapping, the output buffer length is set to 4 times the length of the input token.
          3. ArrayUtils.getNextSize() is used to resize the output buffer when it needs to grow.
          Show
          Steve Rowe added a comment - - edited Forgot to mention three additional change in ASCIIFoldingFilter.patch: Some of the added character mappings produce 3 or 4 characters - e.g. the character "⑾ " – described as "PARENTHESIZED NUMBER ELEVEN" – is mapped to '(' + '1' + '1' +')'. As a result of the increased maximum length of each mapping, the output buffer length is set to 4 times the length of the input token. ArrayUtils.getNextSize() is used to resize the output buffer when it needs to grow.
          Hide
          Steve Rowe added a comment -

          Minor adjustment to previous version: this version fixes a couple of character mappings that were out of order in the mapped-to-capital-"B" section.

          Show
          Steve Rowe added a comment - Minor adjustment to previous version: this version fixes a couple of character mappings that were out of order in the mapped-to-capital-"B" section.
          Hide
          Andi Vajda added a comment -

          Wow, Steve, I'm impressed. This is quite an improvement over my earlier patches and even more of an improvement over ISOLatin1AccentFilter. Thank you for doing this !
          What's next ? Does any Lucene committer watching this bug have objections in checking this in ?
          One (minor) missing piece to the patch is the deprecation of ISOLatin1AccentFilter itself.

          Show
          Andi Vajda added a comment - Wow, Steve, I'm impressed. This is quite an improvement over my earlier patches and even more of an improvement over ISOLatin1AccentFilter. Thank you for doing this ! What's next ? Does any Lucene committer watching this bug have objections in checking this in ? One (minor) missing piece to the patch is the deprecation of ISOLatin1AccentFilter itself.
          Hide
          Steve Rowe added a comment -

          Some of the new mappings, e.g. the character representing parenthesized 10 mapped to the 4 character sequence "(10)", might better be handled before tokenization, very much like SOLR-822 - I thought about doing something like that but decided that it was too much work . However, maybe it would be possible to write a reader adapter that takes in a token filter (like this one) and then passes the entire contents of the reader through the filter.

          Show
          Steve Rowe added a comment - Some of the new mappings, e.g. the character representing parenthesized 10 mapped to the 4 character sequence "(10)", might better be handled before tokenization, very much like SOLR-822 - I thought about doing something like that but decided that it was too much work . However, maybe it would be possible to write a reader adapter that takes in a token filter (like this one) and then passes the entire contents of the reader through the filter.
          Hide
          Robert Muir added a comment -

          I am using this patch and its working well.

          Nitpick... wonder if you could change the mapping of Ə and ə to from E to A... This character is only used in Azeri and not too long ago (<20 years) it was written as A with umlaut, so there is some precedence.

          Thanks,
          Robert

          Show
          Robert Muir added a comment - I am using this patch and its working well. Nitpick... wonder if you could change the mapping of Ə and ə to from E to A... This character is only used in Azeri and not too long ago (<20 years) it was written as A with umlaut, so there is some precedence. Thanks, Robert
          Hide
          Andi Vajda added a comment -

          Could you please attach a patch for the change you requested, I'm not sure
          it's displaying correctly here. You seem to asking about a change for the
          mapping of AE and E+acute which is unexpected. Thanks !

          Show
          Andi Vajda added a comment - Could you please attach a patch for the change you requested, I'm not sure it's displaying correctly here. You seem to asking about a change for the mapping of AE and E+acute which is unexpected. Thanks !
          Hide
          Sean Timm added a comment -

          From my brief reading, it seems that "ae" would be the best transliteration for the schwa character. "Some people write 'æ' instead if the schwa is not available. "

          http://www.omniglot.com/writing/azeri.htm

          Show
          Sean Timm added a comment - From my brief reading, it seems that "ae" would be the best transliteration for the schwa character. "Some people write 'æ' instead if the schwa is not available. " http://www.omniglot.com/writing/azeri.htm
          Hide
          Robert Muir added a comment -

          sean... from your link: On 16th May 1992 the Latin alphabet for Azerbaijani was slightly revised - the letter ä was replaced with ə and the order of letters was changed as well.

          i've never seen 'ae' used in its place, certainly not in the Azeri text that I am indexing.

          andi... im referring to the schwa character in azeri: U+018F (uppercase) and U+0259 (lowercase)

          Show
          Robert Muir added a comment - sean... from your link: On 16th May 1992 the Latin alphabet for Azerbaijani was slightly revised - the letter ä was replaced with ə and the order of letters was changed as well. i've never seen 'ae' used in its place, certainly not in the Azeri text that I am indexing. andi... im referring to the schwa character in azeri: U+018F (uppercase) and U+0259 (lowercase)
          Hide
          Robert Muir added a comment -

          with regards to transliteration the bgn/pcgn standard states:

          The special letter Ə, ə known as schwa, should be reproduced in that form whenever encountered.
          Use Ә (U+04D8) and ә (U+04D9) for schwa when writing in the Cyrillic script, but use Ə (U+018F) and ə (U+0259) for schwa when writing in the Roman alphabet.

          In those instances when it cannot be reproduced, however, the letter Ä ä may be substituted for it.

          http://earth-info.nga.mil/gns/html/Romanization/Romanization_Azerbaijani.pdf

          Show
          Robert Muir added a comment - with regards to transliteration the bgn/pcgn standard states: The special letter Ə, ə known as schwa, should be reproduced in that form whenever encountered. Use Ә (U+04D8) and ә (U+04D9) for schwa when writing in the Cyrillic script, but use Ə (U+018F) and ə (U+0259) for schwa when writing in the Roman alphabet. In those instances when it cannot be reproduced, however, the letter Ä ä may be substituted for it. http://earth-info.nga.mil/gns/html/Romanization/Romanization_Azerbaijani.pdf
          Hide
          Andi Vajda added a comment -

          Ah, I see now what you're asking for. Sorry about the misunderstanding.
          I believe I had picked 'e' for schwa because it looks closest to that
          letter. I have no objections to switching to using 'a' instead if that's
          more "correct".
          This Wikipedia seems to agree: http://en.wikipedia.org/wiki/Schwa_(Cyrillic)
          This other Wikipedia http://en.wikipedia.org/wiki/Schwa is less clear about
          this, but it seems that using 'a' instead of 'e' doesn't contradict it.

          Steven, I can amend the patch but you said you had more changes coming. If
          that's the case, could you please add this change as well. If that's not the
          case, is it ok for me to add this change and call for this bug to be
          committed to trunk and closed ?

          Show
          Andi Vajda added a comment - Ah, I see now what you're asking for. Sorry about the misunderstanding. I believe I had picked 'e' for schwa because it looks closest to that letter. I have no objections to switching to using 'a' instead if that's more "correct". This Wikipedia seems to agree: http://en.wikipedia.org/wiki/Schwa_(Cyrillic ) This other Wikipedia http://en.wikipedia.org/wiki/Schwa is less clear about this, but it seems that using 'a' instead of 'e' doesn't contradict it. Steven, I can amend the patch but you said you had more changes coming. If that's the case, could you please add this change as well. If that's not the case, is it ok for me to add this change and call for this bug to be committed to trunk and closed ?
          Hide
          Steve Rowe added a comment -

          Steven, I can amend the patch but you said you had more changes coming. If that's the case, could you please add this change as well. If that's not the case, is it ok for me to add this change and call for this bug to be committed to trunk and closed ?

          Andi, I don't think I said I had more changes coming. If you are referring to this comment about pre-tokenization filtering, I meant that as an idea that could be pursued separately from this issue.

          At any rate, please feel free to add Robert's suggested change yourself.

          Show
          Steve Rowe added a comment - Steven, I can amend the patch but you said you had more changes coming. If that's the case, could you please add this change as well. If that's not the case, is it ok for me to add this change and call for this bug to be committed to trunk and closed ? Andi, I don't think I said I had more changes coming. If you are referring to this comment about pre-tokenization filtering , I meant that as an idea that could be pursued separately from this issue. At any rate, please feel free to add Robert's suggested change yourself.
          Hide
          Andi Vajda added a comment -

          Great, I'll include Robert's change and try to convince a committer to
          finalize it.

          Show
          Andi Vajda added a comment - Great, I'll include Robert's change and try to convince a committer to finalize it.
          Hide
          Robert Muir added a comment -

          thanks guys, just as a comment to whoever is listining I think this is very useful functionality.

          I am indexing a lot of docs and doing it with ICU works well, but that method (unicode decomposition etc) is very expensive and still doesnt handle many common cases. In profiling, it was slowing down entire indexing process.

          The existing ISO filter doesn't handle many cases that are actually in use in my text, but this filter works well and appears to have coverage for most of the common cases such as full width forms, at the same time it is fast.

          Thanks,
          Robert

          Show
          Robert Muir added a comment - thanks guys, just as a comment to whoever is listining I think this is very useful functionality. I am indexing a lot of docs and doing it with ICU works well, but that method (unicode decomposition etc) is very expensive and still doesnt handle many common cases. In profiling, it was slowing down entire indexing process. The existing ISO filter doesn't handle many cases that are actually in use in my text, but this filter works well and appears to have coverage for most of the common cases such as full width forms, at the same time it is fast. Thanks, Robert
          Hide
          Mark Miller added a comment -

          Hey guys, not sure how soon I can bring some time to bear on this, but I'd be happy to help get this committed.

          Show
          Mark Miller added a comment - Hey guys, not sure how soon I can bring some time to bear on this, but I'd be happy to help get this committed.
          Hide
          Mark Miller added a comment - - edited

          In regards to deprecating ISOLatin1AccentFilter: what are the back compatibility issues? What is the likelyhood that a forced upgrade to this class would lose words in an older index without a reindex?

          If its a real concern, we could keep the deprecated class until 3.0, but that still wouldn't help anyone that wanted to move to 2 from 3 without a reindex (if thats something we will maintain on 3).

          So I'm just exploring for comments, but maybe we leave both classes?

          Show
          Mark Miller added a comment - - edited In regards to deprecating ISOLatin1AccentFilter: what are the back compatibility issues? What is the likelyhood that a forced upgrade to this class would lose words in an older index without a reindex? If its a real concern, we could keep the deprecated class until 3.0, but that still wouldn't help anyone that wanted to move to 2 from 3 without a reindex (if thats something we will maintain on 3). So I'm just exploring for comments, but maybe we leave both classes?
          Hide
          Andi Vajda added a comment -

          This class includes all of ISOLatin1AccentFilter.

          Still, a difference in behaviour could be seen when using the new
          filter with characters getting converted now that didn't before.

          If that sort of lack of backwards compatibility is something we don't want
          to impose on the 3.0 release then the ISOLatin1AccentFilter class needs to
          be preserved.

          Thanks for volunteering to finalize this bug !

          Show
          Andi Vajda added a comment - This class includes all of ISOLatin1AccentFilter. Still, a difference in behaviour could be seen when using the new filter with characters getting converted now that didn't before. If that sort of lack of backwards compatibility is something we don't want to impose on the 3.0 release then the ISOLatin1AccentFilter class needs to be preserved. Thanks for volunteering to finalize this bug !
          Hide
          Steve Rowe added a comment -

          What is the likelyhood that a forced upgrade to this class would lose words in an older index without a reindex?

          The problem would be words that contain characters that were not folded by ISOLatin1AccentFilter, but are folded by ASCIIFoldingFilter, and that are used in documents and in queries. Individual implementors would have to make that determination, but it's not outside the realm of possibility.

          If ISOLatin1AccentFilter were deprecated for 2.9, and advertised as targeted for removal in 3.0, assuming there will be a significant gap in time between the 2.9 and 3.0 releases, that would give users time to complain about its pending demise, and the plan to remove it could be revisited based on that feedback.

          Show
          Steve Rowe added a comment - What is the likelyhood that a forced upgrade to this class would lose words in an older index without a reindex? The problem would be words that contain characters that were not folded by ISOLatin1AccentFilter, but are folded by ASCIIFoldingFilter, and that are used in documents and in queries. Individual implementors would have to make that determination, but it's not outside the realm of possibility. If ISOLatin1AccentFilter were deprecated for 2.9, and advertised as targeted for removal in 3.0, assuming there will be a significant gap in time between the 2.9 and 3.0 releases, that would give users time to complain about its pending demise, and the plan to remove it could be revisited based on that feedback.
          Hide
          Mark Miller added a comment -

          Everything looks pretty good to me. If you can work up one last patch, I'll put it through its paces. I'd like to hear another committers opinion on deprecating ISOLatin1AccentFilter as well, but I guess we will see if we are able to draw the attention.

          I think we have a lot of latitude with the 3.0 move, but I don't know what the consensus is on the limits of that latitude...

          • Mark
          Show
          Mark Miller added a comment - Everything looks pretty good to me. If you can work up one last patch, I'll put it through its paces. I'd like to hear another committers opinion on deprecating ISOLatin1AccentFilter as well, but I guess we will see if we are able to draw the attention. I think we have a lot of latitude with the 3.0 move, but I don't know what the consensus is on the limits of that latitude... Mark
          Hide
          Robert Muir added a comment -

          does ISOLatin1AccentFilter really need to be deprecated? I don't think its misleading, could just reiterate it only covers Latin 1 and ref this one in the docs?

          just as this one documents what blocks it covers and i don't expect it to normalize U+338E to 'mg'

          Show
          Robert Muir added a comment - does ISOLatin1AccentFilter really need to be deprecated? I don't think its misleading, could just reiterate it only covers Latin 1 and ref this one in the docs? just as this one documents what blocks it covers and i don't expect it to normalize U+338E to 'mg'
          Hide
          Mark Miller added a comment -

          does ISOLatin1AccentFilter really need to be deprecated? I don't think its misleading, could just reiterate it only covers Latin 1 and ref this one in the docs?

          Thats the leaning of my thinking at the moment as well.

          Show
          Mark Miller added a comment - does ISOLatin1AccentFilter really need to be deprecated? I don't think its misleading, could just reiterate it only covers Latin 1 and ref this one in the docs? Thats the leaning of my thinking at the moment as well.
          Hide
          Andi Vajda added a comment -

          This latest version supercedes the previous one and moves all schwa characters to the 'A' or 'a' depending on their case. 0259, lowercase schwa, was missing and thus added.

          Show
          Andi Vajda added a comment - This latest version supercedes the previous one and moves all schwa characters to the 'A' or 'a' depending on their case. 0259, lowercase schwa, was missing and thus added.
          Hide
          Andi Vajda added a comment -

          Mark, I attached a new version of the patch with Robert's change.

          As for the deprecation of ISOLatin1AccentFilter.java, I don't have a definite opinion on this.
          It's pretty much redundant with what this new class does. If the maintenance overhead is not too bad then keeping the duplication around may be worth the effort to preserve some backwards compat.

          Thanks for taking this from here !
          Andi..

          Show
          Andi Vajda added a comment - Mark, I attached a new version of the patch with Robert's change. As for the deprecation of ISOLatin1AccentFilter.java, I don't have a definite opinion on this. It's pretty much redundant with what this new class does. If the maintenance overhead is not too bad then keeping the duplication around may be worth the effort to preserve some backwards compat. Thanks for taking this from here ! Andi..
          Hide
          Mark Miller added a comment -

          Not to be wishy washy, but deprecating is looking better to me. If there is a large outcry, it can always be revisited before its fully removed. The odds that you will be affected legitimately, appear pretty low. We can call it out in 3.0 changes, and those that really need it (which is looking like it would be a weird situation to need it) could maintain their own copy. Seem reasonable?

          Show
          Mark Miller added a comment - Not to be wishy washy, but deprecating is looking better to me. If there is a large outcry, it can always be revisited before its fully removed. The odds that you will be affected legitimately, appear pretty low. We can call it out in 3.0 changes, and those that really need it (which is looking like it would be a weird situation to need it) could maintain their own copy. Seem reasonable?
          Hide
          Andi Vajda added a comment -

          Yep, I'm leaning that way too.

          Show
          Andi Vajda added a comment - Yep, I'm leaning that way too.
          Hide
          Mark Miller added a comment -

          So my final thought on this is performance...is handling more much slower? Could that be a reason to keep the Latin1 filter as well?

          Show
          Mark Miller added a comment - So my final thought on this is performance...is handling more much slower? Could that be a reason to keep the Latin1 filter as well?
          Hide
          Robert Muir added a comment -

          its a bit slower, but the difference is minor. i just ran some tests with some cpu-bound (these filters are right at the top of hprof.txt) indexes that i build

          i ran em a couple times and it looks like this... not very scientific but it gives an idea.

          ASCII Folding filter index time (ms): 143365
          ISOLatin1Accent filter (ms): 134649

          Show
          Robert Muir added a comment - its a bit slower, but the difference is minor. i just ran some tests with some cpu-bound (these filters are right at the top of hprof.txt) indexes that i build i ran em a couple times and it looks like this... not very scientific but it gives an idea. ASCII Folding filter index time (ms): 143365 ISOLatin1Accent filter (ms): 134649
          Hide
          Robert Muir added a comment -

          sorry, that wasn't a fair test case. a good chunk of those docs contain accents outside of latin1, so asciifoldingfilter was doing more work

          i reran on some heavily accented (but only latin1) data and the difference was negligible, 1% or so

          appears asciifoldingfilter only slows you down versus isolatin1accentfilter in the case where it probably should be! (you have accents outside of latin1 but are using latin1accentfilter)

          Show
          Robert Muir added a comment - sorry, that wasn't a fair test case. a good chunk of those docs contain accents outside of latin1, so asciifoldingfilter was doing more work i reran on some heavily accented (but only latin1) data and the difference was negligible, 1% or so appears asciifoldingfilter only slows you down versus isolatin1accentfilter in the case where it probably should be! (you have accents outside of latin1 but are using latin1accentfilter)
          Hide
          Mark Miller added a comment -

          Thanks Robert. I plan to commit this in a few days with the deprecation of the latin1 filter for removal in 3.0.

          Show
          Mark Miller added a comment - Thanks Robert. I plan to commit this in a few days with the deprecation of the latin1 filter for removal in 3.0.
          Hide
          Mark Miller added a comment -

          Committed, thanks a lot guys!

          Show
          Mark Miller added a comment - Committed, thanks a lot guys!
          Hide
          Andi Vajda added a comment -

          Thanks Mark !

          Show
          Andi Vajda added a comment - Thanks Mark !

            People

            • Assignee:
              Mark Miller
              Reporter:
              Andi Vajda
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development