Lucene - Core
  1. Lucene - Core
  2. LUCENE-4221

CheckIndex is overeager for term vector offsets bounds checks

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-BETA, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In some situations (like running shingles twice), you end out with a case where startOffset > endOffset.

      We prevent this in IndexWriter for postings offsets, but we never do any validation here for term vectors (at some point, maybe we should make a plan to address this?)

      Anyway, currently CheckIndex will wrongly fail in this situation, which some of our own analyzers even do (e.g. LUCENE-3920)...

      This is an overly-eager validation in checkindex (for vectors, we cannot safely do these assertions as it was/is never enforced by IndexWriter, only for postings offsets).

      1. LUCENE-4221.patch
        8 kB
        Robert Muir
      2. LUCENE-4221.patch
        7 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          This patch disables all the offsets checks for term vectors.

          I'd like a plan to start enforcing this stuff in IndexWriter for term vectors as well so we can actually have these checks on at some point in the future. Sure maybe its annoying that things like ngrams violate all these rules and will fail if term vectors are on, but these are broken analyzers that need to be fixed and we shouldn't allow bogus data in the index.

          The problem with the current situation (besides checkindex), is if someone has such bogus offsets in an older index
          and they try to use something like Highlighter they will just trip errors from OffsetAttribute, etc. So they won't really work.

          Best idea i have so far:

          1. Fix LUCENE-4180 so that we can differentiate between 4.0-alpha and 4.0-beta indexes
          2. Change default term vectors merge impl to buffer one doc in RAM, if it has invalid offsets, clear the offsets bit and dont write them.
          3. Only enable bulk merge for 4.x codec, when the segment was written by 4.0-beta+, otherwise just call super.merge

          One downside is that we must keep the one-doc buffering (part 2) even in trunk until 6.x to support 4.0-alpha indexes, but its too late now.

          Show
          Robert Muir added a comment - This patch disables all the offsets checks for term vectors. I'd like a plan to start enforcing this stuff in IndexWriter for term vectors as well so we can actually have these checks on at some point in the future. Sure maybe its annoying that things like ngrams violate all these rules and will fail if term vectors are on, but these are broken analyzers that need to be fixed and we shouldn't allow bogus data in the index. The problem with the current situation (besides checkindex), is if someone has such bogus offsets in an older index and they try to use something like Highlighter they will just trip errors from OffsetAttribute, etc. So they won't really work. Best idea i have so far: Fix LUCENE-4180 so that we can differentiate between 4.0-alpha and 4.0-beta indexes Change default term vectors merge impl to buffer one doc in RAM, if it has invalid offsets, clear the offsets bit and dont write them. Only enable bulk merge for 4.x codec, when the segment was written by 4.0-beta+, otherwise just call super.merge One downside is that we must keep the one-doc buffering (part 2) even in trunk until 6.x to support 4.0-alpha indexes, but its too late now.
          Hide
          Michael McCandless added a comment -

          +1 to disabling the checks (the patch): CheckIndex should never fail just because analyzer was buggy.

          +1 also for that plan to remove bad offsets on merge.

          Show
          Michael McCandless added a comment - +1 to disabling the checks (the patch): CheckIndex should never fail just because analyzer was buggy. +1 also for that plan to remove bad offsets on merge.
          Hide
          Robert Muir added a comment -

          same patch with a standalone test

          Show
          Robert Muir added a comment - same patch with a standalone test
          Hide
          Robert Muir added a comment -

          The fix is already committed

          Show
          Robert Muir added a comment - The fix is already committed

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development