Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.

      The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break.

      The implementation is available at following repository:
      https://github.com/ippeiukai/ICUNormalizer2CharFilter

      Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs.

      1. 4072.patch
        16 kB
        David Goldfarb
      2. 4072.patch
        16 kB
        David Goldfarb
      3. DebugCode.txt
        5 kB
        Ippei UKAI
      4. ippeiukai-ICUNormalizer2CharFilter-4752cad.zip
        15 kB
        Ippei UKAI
      5. LUCENE-4072.patch
        17 kB
        Robert Muir
      6. LUCENE-4072.patch
        16 kB
        Robert Muir
      7. LUCENE-4072.patch
        14 kB
        David Goldfarb
      8. LUCENE-4072.patch
        14 kB
        David Goldfarb
      9. LUCENE-4072.patch
        14 kB
        Robert Muir
      10. LUCENE-4072.patch
        11 kB
        David Goldfarb
      11. LUCENE-4072.patch
        16 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        This looks cool: it looked tricky to me to implement that iterative normalization api, I'm glad you tackled it

        Its a good idea, for some tokenizers like Ngrams etc you really need normalization as a charfilter.

        I'll see if i can help with the integration.

        Show
        Robert Muir added a comment - This looks cool: it looked tricky to me to implement that iterative normalization api, I'm glad you tackled it Its a good idea, for some tokenizers like Ngrams etc you really need normalization as a charfilter. I'll see if i can help with the integration.
        Hide
        Robert Muir added a comment -

        Hello, if you are willing to contribute this under the Apache license (http://www.apache.org/licenses/LICENSE-2.0.html), would
        you mind uploading a .zip file of your codebase (git has a button to create it), and check the box "Grant license to ASF for inclusion in ASF works"
        when doing the upload?

        Then I can help turn it into a patch. Thanks again!

        Show
        Robert Muir added a comment - Hello, if you are willing to contribute this under the Apache license ( http://www.apache.org/licenses/LICENSE-2.0.html ), would you mind uploading a .zip file of your codebase (git has a button to create it), and check the box "Grant license to ASF for inclusion in ASF works" when doing the upload? Then I can help turn it into a patch. Thanks again!
        Hide
        Ippei UKAI added a comment -

        Robert,
        The zipped copy of the repository is attached and licensed to ASF for inclusion.
        Thanks!

        Show
        Ippei UKAI added a comment - Robert, The zipped copy of the repository is attached and licensed to ASF for inclusion. Thanks!
        Hide
        Robert Muir added a comment -

        attached is the filter, turned into a patch.

        however, I added an additional random test and it currently fails... will look into this more.

        Show
        Robert Muir added a comment - attached is the filter, turned into a patch. however, I added an additional random test and it currently fails... will look into this more.
        Hide
        Ippei UKAI added a comment -

        I've backported your additional test case of random string and reproduced the problem.
        However, I haven't managed to pin point the problem so far. Having no clue coming out of AssertionError, attacking LUCENE-3900 first might be more productive.

        This final offset thing looks like a common occurrence among CharFilters (re LUCENE-3820 LUCENE-3971) but I'm yet to find a common theme in them.

        Show
        Ippei UKAI added a comment - I've backported your additional test case of random string and reproduced the problem. However, I haven't managed to pin point the problem so far. Having no clue coming out of AssertionError, attacking LUCENE-3900 first might be more productive. This final offset thing looks like a common occurrence among CharFilters (re LUCENE-3820 LUCENE-3971 ) but I'm yet to find a common theme in them.
        Hide
        Ippei UKAI added a comment - - edited

        I've found a problem:

        In ICUNormalizer2CharFilter#recordOffsetDiff(int,int), the positive diff case should look like following.
        <pre>
        addOffCorrectMap(charCount + Math.min(1, outputLength), cumuDiff + diff);
        </pre>

        Hopefully, there won't be more.

        I do grant license to ASF to include the above line in ASF works (as per the Apache License §5)

        Show
        Ippei UKAI added a comment - - edited I've found a problem: In ICUNormalizer2CharFilter#recordOffsetDiff(int,int), the positive diff case should look like following. <pre> addOffCorrectMap(charCount + Math.min(1, outputLength), cumuDiff + diff); </pre> Hopefully, there won't be more. I do grant license to ASF to include the above line in ASF works (as per the Apache License §5)
        Hide
        Ippei UKAI added a comment -

        How I debugged for a reference.

        Show
        Ippei UKAI added a comment - How I debugged for a reference.
        Hide
        David Goldfarb added a comment -

        I'm available to help make this work. I updated Ippei UKAI's code to use 4.0 API (CharStream, CharReader, ReusableAnalyzerBase affected). I updated Robert Muir's random input test and it's not failing. I'm not sure if Ippei's last fix worked and this ought to have been closed then. I don't see this class in the Lucene library.

        Let me know if this helps.

        Show
        David Goldfarb added a comment - I'm available to help make this work. I updated Ippei UKAI 's code to use 4.0 API (CharStream, CharReader, ReusableAnalyzerBase affected). I updated Robert Muir 's random input test and it's not failing. I'm not sure if Ippei's last fix worked and this ought to have been closed then. I don't see this class in the Lucene library. Let me know if this helps.
        Hide
        Robert Muir added a comment -

        Thanks David! I will look into this, I was unaware the previous problem had been resolved!

        Show
        Robert Muir added a comment - Thanks David! I will look into this, I was unaware the previous problem had been resolved!
        Hide
        Robert Muir added a comment -

        I looked over the patch, and added license headers and so on.

        I also added some new tests, which currently fail. I think the problem is that the current logic iterates characters (e.g. passing charAt to hasBoundaryBefore and so on), when it should be passing codepoints to these methods.

        Show
        Robert Muir added a comment - I looked over the patch, and added license headers and so on. I also added some new tests, which currently fail. I think the problem is that the current logic iterates characters (e.g. passing charAt to hasBoundaryBefore and so on), when it should be passing codepoints to these methods.
        Hide
        David Goldfarb added a comment -

        Indeed, changing the code to iterate over codepoints fixed a majority of the test failures.

        The random tests still fail sometimes – I believe there's a bug in Normalizer2. I submitted a bug report here.

        Show
        David Goldfarb added a comment - Indeed, changing the code to iterate over codepoints fixed a majority of the test failures. The random tests still fail sometimes – I believe there's a bug in Normalizer2. I submitted a bug report here .
        Hide
        David Goldfarb added a comment -

        Is this considered blocked on ICU? Let me know if there's anything else I can contribute to this.

        Show
        David Goldfarb added a comment - Is this considered blocked on ICU? Let me know if there's anything else I can contribute to this.
        Hide
        Robert Muir added a comment -

        Hi David: I havent taken a look at the impact of the ICU bug (i'm not really that familiar with the incremental normalization API), but it seems rather serious.

        Is it possible to avoid use of hasBoundaryAfter? In addition to the bug you found, it has the warning that it may be significantly slower than hasBoundaryBefore: I'm wondering if we can dodge it.

        Show
        Robert Muir added a comment - Hi David: I havent taken a look at the impact of the ICU bug (i'm not really that familiar with the incremental normalization API), but it seems rather serious. Is it possible to avoid use of hasBoundaryAfter? In addition to the bug you found, it has the warning that it may be significantly slower than hasBoundaryBefore: I'm wondering if we can dodge it.
        Hide
        David Goldfarb added a comment -

        This patch dodges the use of hasBoundaryAfter, and the tests pass.

        Note in doTestMode there's a clause that checks if the normalized string has length zero. It seems the nfkc_cf-normalized output of some strings is empty. Examples I found:
        '\uDB40\uDCD9'
        '\uDB43\uDF86'
        '\uFE04'

        Show
        David Goldfarb added a comment - This patch dodges the use of hasBoundaryAfter, and the tests pass. Note in doTestMode there's a clause that checks if the normalized string has length zero. It seems the nfkc_cf-normalized output of some strings is empty. Examples I found: '\uDB40\uDCD9' '\uDB43\uDF86' '\uFE04'
        Hide
        Robert Muir added a comment -

        Thanks so much for attacking this David: I think that 0-length "all default ignorables" case makes sense (where it creates an empty string), because in that case there won't be a single token at all (MockTokenizer is not a perfect emulator of KeywordTokenizer here).

        I think this patch is close, but when running the test a few hundred times I hit a failure (see my added testCuriousString, which fails). I think this one is a bug in the logic.

        Motivated by this fail, I tried to beef up tests in general:

        • fixed my typo where testNFD wasnt actually testing NFD
        • test strings > 20 characters, since this filter has an internal 128-char buffer.

        The latter seems to expose a lot of bugs, I assume due to the internal buffering. I haven't yet looked into this. But it seems there are correctness issues for documents > 128 chars (as well as what I believe is a separate bug seen by testCuriousString, which I think is some bug in the logic related to ignorables).

        Show
        Robert Muir added a comment - Thanks so much for attacking this David: I think that 0-length "all default ignorables" case makes sense (where it creates an empty string), because in that case there won't be a single token at all (MockTokenizer is not a perfect emulator of KeywordTokenizer here). I think this patch is close, but when running the test a few hundred times I hit a failure (see my added testCuriousString, which fails). I think this one is a bug in the logic. Motivated by this fail, I tried to beef up tests in general: fixed my typo where testNFD wasnt actually testing NFD test strings > 20 characters, since this filter has an internal 128-char buffer. The latter seems to expose a lot of bugs, I assume due to the internal buffering. I haven't yet looked into this. But it seems there are correctness issues for documents > 128 chars (as well as what I believe is a separate bug seen by testCuriousString, which I think is some bug in the logic related to ignorables).
        Hide
        Robert Muir added a comment -

        ok as for the testCuriousString bug, I enabled verbose (ant test -Dtestcase=TestICUNormalizer2CharFilter -Dtestmethod=testCuriousString -Dtests.verbose=true) and it seems to always fail when given a "spoon-fed" Reader. So Ill dig into this one, I think it involves how this charfilter consumes the reader api.

        Show
        Robert Muir added a comment - ok as for the testCuriousString bug, I enabled verbose (ant test -Dtestcase=TestICUNormalizer2CharFilter -Dtestmethod=testCuriousString -Dtests.verbose=true) and it seems to always fail when given a "spoon-fed" Reader. So Ill dig into this one, I think it involves how this charfilter consumes the reader api.
        Hide
        Robert Muir added a comment -

        One thing that certainly looks like a bug is this:

        The input-processing side looks like this in pseudocode:

        while (read() some char[]s) {
           normalize(char[]s) // (quick check/hasBoundary/etc)
        }
        

        But read() works at char level, and these normalization apis want ints.
        So I think readInputToBuffer() needs to keep reading, if possible, to ensure it fully consumes whole codepoints before returning. I added a little hack locally, but it didnt seem to clean up the test fails, so I think there are other bugs too, or I'm missing something?

          private int readInputToBuffer() throws IOException {
            final int len = input.read(tmpBuffer);
            if (len == -1) {
              inputFinished = true;
              return 0;
            }
            inputBuffer.append(tmpBuffer, 0, len);
            // nocommit: just a hack
            // if buffer ends on high surrogate, keep reading before processing
            if (len > 0 && Character.isHighSurrogate(tmpBuffer[len-1])) {
              return len + readInputToBuffer();
            }
            // end hack
            return len;
          }
        
        Show
        Robert Muir added a comment - One thing that certainly looks like a bug is this: The input-processing side looks like this in pseudocode: while (read() some char []s) { normalize( char []s) // (quick check/hasBoundary/etc) } But read() works at char level, and these normalization apis want ints. So I think readInputToBuffer() needs to keep reading, if possible, to ensure it fully consumes whole codepoints before returning. I added a little hack locally, but it didnt seem to clean up the test fails, so I think there are other bugs too, or I'm missing something? private int readInputToBuffer() throws IOException { final int len = input.read(tmpBuffer); if (len == -1) { inputFinished = true ; return 0; } inputBuffer.append(tmpBuffer, 0, len); // nocommit: just a hack // if buffer ends on high surrogate, keep reading before processing if (len > 0 && Character .isHighSurrogate(tmpBuffer[len-1])) { return len + readInputToBuffer(); } // end hack return len; }
        Hide
        David Goldfarb added a comment - - edited

        Attaching a new patch - testCuriousString still fails.

        You're right about readInputToBuffer. I think we also have to stop only on normalization boundaries. I see two options:
        use normalizer.hasBoundaryAfter(tmpBuffer[len-1]) (straightforward)
        or
        use normalizer.hasBoundaryBefore(tmpBuffer[len-1]) and use mark() and reset().

          private int readInputToBuffer() throws IOException {
            final int len = input.read(tmpBuffer);
            if (len == -1) {
              inputFinished = true;
              return 0;
            }
            inputBuffer.append(tmpBuffer, 0, len);
            if (len >= 2 && normalizer.hasBoundaryAfter(tmpBuffer[len-1]) && !Character.isHighSurrogate(tmpBuffer[len-1])) {
                return len;
            } else return len + readInputToBuffer();
          }
        

        [edit]
        And the len >= 2 clause wasn't meant to be part of the patch, ignore that.

        if (normalizer.hasBoundaryAfter(tmpBuffer[len-1]) && !Character.isHighSurrogate(tmpBuffer[len-1])) {
                return len;
        } else return len + readInputToBuffer();
        
        Show
        David Goldfarb added a comment - - edited Attaching a new patch - testCuriousString still fails. You're right about readInputToBuffer. I think we also have to stop only on normalization boundaries. I see two options: use normalizer.hasBoundaryAfter(tmpBuffer[len-1]) (straightforward) or use normalizer.hasBoundaryBefore(tmpBuffer[len-1]) and use mark() and reset(). private int readInputToBuffer() throws IOException { final int len = input.read(tmpBuffer); if (len == -1) { inputFinished = true; return 0; } inputBuffer.append(tmpBuffer, 0, len); if (len >= 2 && normalizer.hasBoundaryAfter(tmpBuffer[len-1]) && !Character.isHighSurrogate(tmpBuffer[len-1])) { return len; } else return len + readInputToBuffer(); } [edit] And the len >= 2 clause wasn't meant to be part of the patch, ignore that. if (normalizer.hasBoundaryAfter(tmpBuffer[len-1]) && !Character.isHighSurrogate(tmpBuffer[len-1])) { return len; } else return len + readInputToBuffer();
        Hide
        David Goldfarb added a comment -

        Attaching a new patch. All tests pass.

        I'm using Normalizer2.isInert to check if we need to keep reading to the input buffer since it doesn't return false positives, even though it's not as fast as .hasBoundaryBefore().

        Show
        David Goldfarb added a comment - Attaching a new patch. All tests pass. I'm using Normalizer2.isInert to check if we need to keep reading to the input buffer since it doesn't return false positives, even though it's not as fast as .hasBoundaryBefore().
        Hide
        Robert Muir added a comment -

        Whew, thank you!

        I did some minor cleanup: I toned down the tests i had added that were very slow (added multiplier, so they will do more work in jenkins), added testMassiveLigature (just to test the case where normalization increases the length), and removed the stuff around reset()... since mark isnt supported the default UOE is the right thing.

        I'll commit shortly

        Show
        Robert Muir added a comment - Whew, thank you! I did some minor cleanup: I toned down the tests i had added that were very slow (added multiplier, so they will do more work in jenkins), added testMassiveLigature (just to test the case where normalization increases the length), and removed the stuff around reset()... since mark isnt supported the default UOE is the right thing. I'll commit shortly
        Hide
        ASF subversion and git services added a comment -

        Commit 1579488 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1579488 ]

        LUCENE-4072: add ICUNormalizer2CharFilter

        Show
        ASF subversion and git services added a comment - Commit 1579488 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1579488 ] LUCENE-4072 : add ICUNormalizer2CharFilter
        Hide
        ASF subversion and git services added a comment -

        Commit 1579491 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1579491 ]

        LUCENE-4072: add ICUNormalizer2CharFilter

        Show
        ASF subversion and git services added a comment - Commit 1579491 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1579491 ] LUCENE-4072 : add ICUNormalizer2CharFilter
        Hide
        Robert Muir added a comment -

        Thank you Ippei UKAI and David Goldfarb !

        Show
        Robert Muir added a comment - Thank you Ippei UKAI and David Goldfarb !
        Hide
        ASF subversion and git services added a comment -

        Commit 1579498 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1579498 ]

        LUCENE-4072: add factory

        Show
        ASF subversion and git services added a comment - Commit 1579498 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1579498 ] LUCENE-4072 : add factory
        Hide
        ASF subversion and git services added a comment -

        Commit 1579499 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1579499 ]

        LUCENE-4072: add factory

        Show
        ASF subversion and git services added a comment - Commit 1579499 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1579499 ] LUCENE-4072 : add factory
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Unassigned
            Reporter:
            Ippei UKAI
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development