Lucene - Core
  1. Lucene - Core
  2. LUCENE-2016

replace invalid U+FFFF character during indexing

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.4.1, 2.9
    • Fix Version/s: 2.9.1, 3.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict. CheckIndex will catch the error, and merging will hit exceptions (I think).

      We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

      1. LUCENE-2016.patch
        2 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Bulk close all 2.9.1 issues.

        Show
        Michael McCandless added a comment - Bulk close all 2.9.1 issues.
        Hide
        Robert Muir added a comment -

        Earwin, take a look at LUCENE-2019. I added a hyperlink to the list there...

        Show
        Robert Muir added a comment - Earwin, take a look at LUCENE-2019 . I added a hyperlink to the list there...
        Hide
        Earwin Burrfoot added a comment -

        Being one of those hit by U+FFFF earlier, I'd rather like to see remapping happen in some filter and IW throwing an exception on what it deems 'illegal'. Or at very least a big fat documentation entry, that jumps in your face somehow and lists all codepoints that will be remapped.

        Show
        Earwin Burrfoot added a comment - Being one of those hit by U+FFFF earlier, I'd rather like to see remapping happen in some filter and IW throwing an exception on what it deems 'illegal'. Or at very least a big fat documentation entry, that jumps in your face somehow and lists all codepoints that will be remapped.
        Hide
        Robert Muir added a comment -

        Michael, duh I think smart chinese has damaged my brain for the rest of the day.

        Thanks for fixing this.

        Show
        Robert Muir added a comment - Michael, duh I think smart chinese has damaged my brain for the rest of the day. Thanks for fixing this.
        Hide
        Michael McCandless added a comment -

        is there a possibility with your patch still of index problems if you had foobar<U+FFFF> but also foobar<U+FFFF><U+FFFF> ??? will it create duplicate terms of foobar<U+FFFD> ?

        I think this won't cause problems – that term will just be rewritten to foobar\ufffd\ufffd.

        Show
        Michael McCandless added a comment - is there a possibility with your patch still of index problems if you had foobar<U+FFFF> but also foobar<U+FFFF><U+FFFF> ??? will it create duplicate terms of foobar<U+FFFD> ? I think this won't cause problems – that term will just be rewritten to foobar\ufffd\ufffd.
        Hide
        Robert Muir added a comment -

        Michael, one last question. is there a possibility with your patch still of index problems if you had foobar<U+FFFF> but also foobar<U+FFFF><U+FFFF> ??? will it create duplicate terms of foobar<U+FFFD> ?

        Show
        Robert Muir added a comment - Michael, one last question. is there a possibility with your patch still of index problems if you had foobar<U+FFFF> but also foobar<U+FFFF><U+FFFF> ??? will it create duplicate terms of foobar<U+FFFD> ?
        Hide
        Robert Muir added a comment - - edited

        Tricky semantics It rather depends on if you consider Lucene part if your "process-internally" . Depending on the use case, it could be either.

        Not really, Lucene-java uses U+FFFF process-internally, but wasn't mapping it to something valid in the index. So when U+FFFF was stored in the index (or rather, wasn't being stored but handled incorrectly), it created an issue. This is a perfect example of this.

        Show
        Robert Muir added a comment - - edited Tricky semantics It rather depends on if you consider Lucene part if your "process-internally" . Depending on the use case, it could be either. Not really, Lucene-java uses U+FFFF process-internally, but wasn't mapping it to something valid in the index. So when U+FFFF was stored in the index (or rather, wasn't being stored but handled incorrectly), it created an issue. This is a perfect example of this.
        Hide
        Yonik Seeley added a comment -

        This is not true. if you map them to replacement characters, then my app is free to use them "process-internally"

        Tricky semantics It rather depends on if you consider Lucene part if your "process-internally" . Depending on the use case, it could be either.

        Show
        Yonik Seeley added a comment - This is not true. if you map them to replacement characters, then my app is free to use them "process-internally" Tricky semantics It rather depends on if you consider Lucene part if your "process-internally" . Depending on the use case, it could be either.
        Hide
        Robert Muir added a comment -

        But if we forcefully map all invalid-for-interchange unicode characters to the replacement character (I think that's what's being proposed, right?), then your app no longer has any characters it can use for its own "internal" purposes?

        This is not true. if you map them to replacement characters, then my app is free to use them "process-internally" as specified by the standard, without any concern that they will appear in the "interchange" (lucene index data).

        I agree with you, lets open a separate "anal unicode issue". Lets go with your U+FFFF fix for Lucene 2.9, since it fixes lucene java, but correct this for 3.x in the future?

        Show
        Robert Muir added a comment - But if we forcefully map all invalid-for-interchange unicode characters to the replacement character (I think that's what's being proposed, right?), then your app no longer has any characters it can use for its own "internal" purposes? This is not true. if you map them to replacement characters, then my app is free to use them "process-internally" as specified by the standard, without any concern that they will appear in the "interchange" (lucene index data). I agree with you, lets open a separate "anal unicode issue". Lets go with your U+FFFF fix for Lucene 2.9, since it fixes lucene java, but correct this for 3.x in the future?
        Hide
        Michael McCandless added a comment -

        Finally, I completely disagree with the nontrivial performance comment. The trick is to make sure the execution branch / checks for the process-internal characters outside the bmp, only occurs for surrogate pairs. They are statistically very rare and if done right, it will not affect performance of BMP content.

        OK I agree, you're right: we could in fact do this with negligible impact to performance.

        Its my understanding Lucene indexes should be portable to different programming languages: perhaps my implementation in C/perl/python decides to use a different process-internal character, this is allowed by Unicode and I think we should adhere to it, I don't think its being anal.

        But if we forcefully map all invalid-for-interchange unicode characters to the replacement character (I think that's what's being proposed, right?), then your app no longer has any characters it can use for its own "internal" purposes?

        Can you open a new issue to track this? This is a wider discussion than preventing index corruption

        Show
        Michael McCandless added a comment - Finally, I completely disagree with the nontrivial performance comment. The trick is to make sure the execution branch / checks for the process-internal characters outside the bmp, only occurs for surrogate pairs. They are statistically very rare and if done right, it will not affect performance of BMP content. OK I agree, you're right: we could in fact do this with negligible impact to performance. Its my understanding Lucene indexes should be portable to different programming languages: perhaps my implementation in C/perl/python decides to use a different process-internal character, this is allowed by Unicode and I think we should adhere to it, I don't think its being anal. But if we forcefully map all invalid-for-interchange unicode characters to the replacement character (I think that's what's being proposed, right?), then your app no longer has any characters it can use for its own "internal" purposes? Can you open a new issue to track this? This is a wider discussion than preventing index corruption
        Hide
        Robert Muir added a comment -

        Mike, I disagree.

        Here is my reasoning: Lucene Java happens to use U+FFFF as an internal identifier for processing.
        However, this is your choice, you could have just as easily used U+FFFE, or some other codepoint even outside the BMP for this purpose. The standard gives you several options, perhaps you might need multiple process-internal characters to accomplish what you want to do internally.

        Its my understanding Lucene indexes should be portable to different programming languages: perhaps my implementation in C/perl/python decides to use a different process-internal character, this is allowed by Unicode and I think we should adhere to it, I don't think its being anal.

        Finally, I completely disagree with the nontrivial performance comment. The trick is to make sure the execution branch / checks for the process-internal characters outside the bmp, only occurs for surrogate pairs. They are statistically very rare and if done right, it will not affect performance of BMP content.

        Show
        Robert Muir added a comment - Mike, I disagree. Here is my reasoning: Lucene Java happens to use U+FFFF as an internal identifier for processing. However, this is your choice, you could have just as easily used U+FFFE, or some other codepoint even outside the BMP for this purpose. The standard gives you several options, perhaps you might need multiple process-internal characters to accomplish what you want to do internally. Its my understanding Lucene indexes should be portable to different programming languages: perhaps my implementation in C/perl/python decides to use a different process-internal character, this is allowed by Unicode and I think we should adhere to it, I don't think its being anal. Finally, I completely disagree with the nontrivial performance comment. The trick is to make sure the execution branch / checks for the process-internal characters outside the bmp, only occurs for surrogate pairs. They are statistically very rare and if done right, it will not affect performance of BMP content.
        Hide
        Michael McCandless added a comment -

        Lucene has "traditionally" not enforced the "not for interchange"
        characters, ie, just let them through.

        But then with the indexing speedups (LUCENE-843), we no longer allowed
        U+FFFF, and with the cutover to true UTF-8 in the index, we no longer
        allowed invalid surrogate pairs.

        And we know apps use these characters (because they hit problems with
        U+FFFF on upgrading to 2.3).

        So I think it would be too anal to suddenly replace all of these
        invalid interchange chars, starting today? (Though, it would
        obviously be more "standards compliant"). Plus, it would cost us non
        trivial indexing CPU to do so!!

        Show
        Michael McCandless added a comment - Lucene has "traditionally" not enforced the "not for interchange" characters, ie, just let them through. But then with the indexing speedups ( LUCENE-843 ), we no longer allowed U+FFFF, and with the cutover to true UTF-8 in the index, we no longer allowed invalid surrogate pairs. And we know apps use these characters (because they hit problems with U+FFFF on upgrading to 2.3). So I think it would be too anal to suddenly replace all of these invalid interchange chars, starting today? (Though, it would obviously be more "standards compliant"). Plus, it would cost us non trivial indexing CPU to do so!!
        Hide
        Robert Muir added a comment - - edited

        even disregarding the problem, I think FFFD is much better than truncation... its what i expect.

        but I think we should handle U+FFFE also. (and FDD0-FDEF)

        there are actually a few more 'guaranteed not to be a characters, not for interchange' outside of the BMP, but that invalid surrogate logic looks pretty hairy already
        to see the full list, look at http://www.unicode.org/charts/

        under subheading:
        Noncharacters in Charts

        Reserved range

        Noncharacters at end of ...

        Show
        Robert Muir added a comment - - edited even disregarding the problem, I think FFFD is much better than truncation... its what i expect. but I think we should handle U+FFFE also. (and FDD0-FDEF) there are actually a few more 'guaranteed not to be a characters, not for interchange' outside of the BMP, but that invalid surrogate logic looks pretty hairy already to see the full list, look at http://www.unicode.org/charts/ under subheading: Noncharacters in Charts Reserved range Noncharacters at end of ...
        Hide
        Michael McCandless added a comment -

        Fix is trivial.

        Show
        Michael McCandless added a comment - Fix is trivial.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development