Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If offsets go negative or backwards, it can corrupt the index with DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS: the offsets will have wrong values (different from the term vectors) or even crazy values like -2147483645

      The problem with this is that its not just theoretical: its too easy to do this with lucene's own analyzer chains (e.g. ngramtokenizer).

      See issues such as LUCENE-3920 and some discussion on LUCENE-3738

      The question is how to fix this, e.g. should we:

      1. start enforcing that offsets cannot be crazy values in OffsetAttributeImpl/IndexWriter and fix the broken analyzers
      2. leave offsets as a pair of opaque integers, declaring this a limitation of the current codec, and either workaround or throw UOE from the postings writer.

        Attachments

        1. LUCENE-4127_offsetAtt.patch
          4 kB
          Robert Muir
        2. LUCENE-4127_test.patch
          2 kB
          Robert Muir
        3. LUCENE-4127.patch
          17 kB
          Robert Muir
        4. LUCENE-4127.patch
          6 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rcmuir Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: