Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If offsets go negative or backwards, it can corrupt the index with DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS: the offsets will have wrong values (different from the term vectors) or even crazy values like -2147483645

      The problem with this is that its not just theoretical: its too easy to do this with lucene's own analyzer chains (e.g. ngramtokenizer).

      See issues such as LUCENE-3920 and some discussion on LUCENE-3738

      The question is how to fix this, e.g. should we:

      1. start enforcing that offsets cannot be crazy values in OffsetAttributeImpl/IndexWriter and fix the broken analyzers
      2. leave offsets as a pair of opaque integers, declaring this a limitation of the current codec, and either workaround or throw UOE from the postings writer.
      1. LUCENE-4127_offsetAtt.patch
        4 kB
        Robert Muir
      2. LUCENE-4127_test.patch
        2 kB
        Robert Muir
      3. LUCENE-4127.patch
        17 kB
        Robert Muir
      4. LUCENE-4127.patch
        6 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment -

        Here are some dead-simple tests.

        Show
        Robert Muir added a comment - Here are some dead-simple tests.
        Hide
        Robert Muir added a comment -

        just my opinion, I tend towards #1 here. I think offsets should be strongly typed (they are the character offsets),
        and we should enforce this so that highlighting actually works.

        if people want a place to dump opaque data per-position, they can just use the payload for that.

        Show
        Robert Muir added a comment - just my opinion, I tend towards #1 here. I think offsets should be strongly typed (they are the character offsets), and we should enforce this so that highlighting actually works. if people want a place to dump opaque data per-position, they can just use the payload for that.
        Hide
        Michael McCandless added a comment -

        +1 to strongly type offsets.

        Show
        Michael McCandless added a comment - +1 to strongly type offsets.
        Hide
        Michael McCandless added a comment -

        I think we should also strongly check posIncr coming into IndexWriter ... attached patch does that and fixes a couple tests that were sending posInc=0 for first token.

        Show
        Michael McCandless added a comment - I think we should also strongly check posIncr coming into IndexWriter ... attached patch does that and fixes a couple tests that were sending posInc=0 for first token.
        Hide
        Robert Muir added a comment -

        +1! i feel like we should be able to eventually refactor/cleanup docinverter after this (with the silly correction removed),
        but I think this should happen under another issue!

        Show
        Robert Muir added a comment - +1! i feel like we should be able to eventually refactor/cleanup docinverter after this (with the silly correction removed), but I think this should happen under another issue!
        Hide
        Robert Muir added a comment -

        Here's a patch adding the stateless checks to OffsetAttributeImpl. I think its a good step, and tests pass.

        Show
        Robert Muir added a comment - Here's a patch adding the stateless checks to OffsetAttributeImpl. I think its a good step, and tests pass.
        Hide
        Robert Muir added a comment -

        Here's a patch: I think its committable (e.g. so we can get alpha release out).

        As a followup I think we should enable the docinverter check when termVectorOffsets are enabled, enable the backwards-offsets check in BaseTokenStreamTestCase, fix the broken analyzers, and improve the tests some more.

        Show
        Robert Muir added a comment - Here's a patch: I think its committable (e.g. so we can get alpha release out). As a followup I think we should enable the docinverter check when termVectorOffsets are enabled, enable the backwards-offsets check in BaseTokenStreamTestCase, fix the broken analyzers, and improve the tests some more.
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Robert Muir added a comment -

        This was already committed (never marked resolved)

        Show
        Robert Muir added a comment - This was already committed (never marked resolved)

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development