Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4127

negative offsets/deltas corrumption

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.0-ALPHA
    • 4.0-ALPHA
    • core/index
    • None
    • New

    Description

      If offsets go negative or backwards, it can corrupt the index with DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS: the offsets will have wrong values (different from the term vectors) or even crazy values like -2147483645

      The problem with this is that its not just theoretical: its too easy to do this with lucene's own analyzer chains (e.g. ngramtokenizer).

      See issues such as LUCENE-3920 and some discussion on LUCENE-3738

      The question is how to fix this, e.g. should we:

      1. start enforcing that offsets cannot be crazy values in OffsetAttributeImpl/IndexWriter and fix the broken analyzers
      2. leave offsets as a pair of opaque integers, declaring this a limitation of the current codec, and either workaround or throw UOE from the postings writer.

      Attachments

        1. LUCENE-4127.patch
          6 kB
          Michael McCandless
        2. LUCENE-4127.patch
          17 kB
          Robert Muir
        3. LUCENE-4127_test.patch
          2 kB
          Robert Muir
        4. LUCENE-4127_offsetAtt.patch
          4 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: