Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      A user reported this because it was causing his highlighting to throw an error.

      1. 6B2Uh.png
        143 kB
        Robert Muir
      2. LUCENE-3642_test.patch
        3 kB
        Robert Muir
      3. LUCENE-3642_ngrams.patch
        9 kB
        Robert Muir
      4. LUCENE-3642.patch
        19 kB
        Robert Muir
      5. LUCENE-3642.patch
        20 kB
        Robert Muir
      6. LUCENE-3642.patch
        25 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        screenshot from the user

        Show
        Robert Muir added a comment - screenshot from the user
        Hide
        Robert Muir added a comment -

        I thought up a hackish way we can test for these invalid offsets for all filters... I'll see if it works.

        Show
        Robert Muir added a comment - I thought up a hackish way we can test for these invalid offsets for all filters... I'll see if it works.
        Hide
        Robert Muir added a comment -

        here's a test.

        the problem is a previous filter 'lengthens' this term by folding æ -> ae, but EdgeNGramFilter computes the offsets "additively": offsetAtt.setOffset(tokStart + start, tokStart + end);

        Because of this if a word has been 'lengthened' by a previous filter, edgengram will produce offsets that are longer than the original text. (and probably bogus ones if its been shortened).

        I think we should what WDF does here, if the original offsets have already been changed (startOffset + termLength != endOffset), then we should simply preserve them for the new subwords.

        I added a check for this to basetokenstreamtestcase... now to see if anything else fails...

        Show
        Robert Muir added a comment - here's a test. the problem is a previous filter 'lengthens' this term by folding æ -> ae, but EdgeNGramFilter computes the offsets "additively": offsetAtt.setOffset(tokStart + start, tokStart + end); Because of this if a word has been 'lengthened' by a previous filter, edgengram will produce offsets that are longer than the original text. (and probably bogus ones if its been shortened). I think we should what WDF does here, if the original offsets have already been changed (startOffset + termLength != endOffset), then we should simply preserve them for the new subwords. I added a check for this to basetokenstreamtestcase... now to see if anything else fails...
        Hide
        Robert Muir added a comment -

        so my assert trips for shit like whitespacespacetokenizer + lowercase... how horrible is that?

        There must be offset bugs in CharTokenizer... i'll dig into it.

        Show
        Robert Muir added a comment - so my assert trips for shit like whitespacespacetokenizer + lowercase... how horrible is that? There must be offset bugs in CharTokenizer... i'll dig into it.
        Hide
        Robert Muir added a comment -

        Here's a patch fixing the (edge)ngrams filters, using the same logic as wdf (its well-defined, i think its the only thing we can do here).

        Still need to fix the chartokenizer bug, and also add some tests for any other "filters that are actually tokenizers" we might have.

        Show
        Robert Muir added a comment - Here's a patch fixing the (edge)ngrams filters, using the same logic as wdf (its well-defined, i think its the only thing we can do here). Still need to fix the chartokenizer bug, and also add some tests for any other "filters that are actually tokenizers" we might have.
        Hide
        Max Beutel added a comment -

        Robert, that patch for the EdgeNGramTokenFilter worked. If there occur any problems I let you know. Thanks!

        Show
        Max Beutel added a comment - Robert, that patch for the EdgeNGramTokenFilter worked. If there occur any problems I let you know. Thanks!
        Hide
        Robert Muir added a comment -

        Thanks Max, I am currently adding more tests/fixes for other broken tokenizers/filters with offset bugs.

        I'll update the patch when these are passing, but i think the ngrams stuff is ok.

        Show
        Robert Muir added a comment - Thanks Max, I am currently adding more tests/fixes for other broken tokenizers/filters with offset bugs. I'll update the patch when these are passing, but i think the ngrams stuff is ok.
        Hide
        Robert Muir added a comment -

        updated patch with a test+fix for smartchinese, and with a test for CharTokenizer... it currently fails with an off by one (incorrect startOffset) which is in turn jacking up the endOffsets too.

        Show
        Robert Muir added a comment - updated patch with a test+fix for smartchinese, and with a test for CharTokenizer... it currently fails with an off by one (incorrect startOffset) which is in turn jacking up the endOffsets too.
        Hide
        Robert Muir added a comment -

        here's the fix for CharTokenizer.

        Tests are passing, I will commit soon.

        Show
        Robert Muir added a comment - here's the fix for CharTokenizer. Tests are passing, I will commit soon.
        Hide
        Robert Muir added a comment -

        Just looking i see another bug in CharTOkenizer... i'll add another test.

        Show
        Robert Muir added a comment - Just looking i see another bug in CharTOkenizer... i'll add another test.
        Hide
        Robert Muir added a comment -

        patch with tests and fix for the additional bug in CharTokenizer.

        Show
        Robert Muir added a comment - patch with tests and fix for the additional bug in CharTokenizer.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            5 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development