Lucene - Core
  1. Lucene - Core
  2. LUCENE-2142

FieldCache.getStringIndex should not throw exception if term count exceeds doc count

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.4, 3.0.3, 3.1, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff of LUCENE-2133/LUCENE-831.

      Currently FieldCache cannot handle more than one value per field.
      We may someday want to fix that... but until that day:

      FieldCache.getStringIndex currently does a simplistic check to try to
      catch when you've accidentally allowed more than one term per field,
      by testing if the number of unique terms exceeds the number of
      documents.

      The problem is, this is not a perfect check, in that it allows false
      negatives (you could have more than one term per field for some docs
      and the check won't catch you).

      Further, the exception thrown is the unchecked RuntimeException.

      So this means... you could happily think all is good, until some day,
      well into production, once you've updated enough docs, suddenly the
      check will catch you and throw an unhandled exception, stopping all
      searches [that need to sort by this string field] in their tracks.
      It's not gracefully degrading.

      I think we should simply remove the test, ie, if you have more terms
      than docs then the terms simply overwrite one another.

      1. LUCENE-2142-fix-3x.patch
        2 kB
        Uwe Schindler
      2. LUCENE-2142-fix-trunk.patch
        1 kB
        Uwe Schindler
      3. LUCENE-2142-trunk.patch
        3 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Patch to fix trunk to stop loading when number of terms > maxDoc.

          Show
          Michael McCandless added a comment - Patch to fix trunk to stop loading when number of terms > maxDoc.
          Hide
          Uwe Schindler added a comment -

          Committed in trunk, 3x, 3.0, 2.9 branches.

          Trunk is still missing the escape-branch when term count exceeds doc count.

          Show
          Uwe Schindler added a comment - Committed in trunk, 3x, 3.0, 2.9 branches. Trunk is still missing the escape-branch when term count exceeds doc count.
          Hide
          Michael McCandless added a comment -

          Thanks Uwe!

          So your fix avoids any exception altogether. On 3x, you just stop
          loading when we hit a termOrd > number of docs. On trunk, we keep
          loading, simply growing the array as needed.

          I'm torn on what the best disposition here is. This API should only
          be used on single-token (per doc) fields, so this handling we're
          adding/fixing is about how to handle the misuse of the API.

          Neither solution is great – throwing an exception is nasty since you
          could be fine for some time and then only on indexing enough docs,
          perhaps well into production, trip the exception. But then silently
          pretending nothing is wrong is also not great because the app then has
          no clue.

          Really this'd be a great time to use a logging framework – we'd drop
          a error, and then not throw an exception.

          Net/net I think your solution (don't throw an exception) is the lesser
          evil at this time, so I think we should go with that.

          But: I think we should also fix trunk? Ie, if hit termOrd > numDocs,
          silently break, instead of trying to grow the array. Because now (on
          trunk) if you try to load a DocTermsIndex on a large tokenized text
          field in a large index you'll (try to) use insane amounts of memory...

          Show
          Michael McCandless added a comment - Thanks Uwe! So your fix avoids any exception altogether. On 3x, you just stop loading when we hit a termOrd > number of docs. On trunk, we keep loading, simply growing the array as needed. I'm torn on what the best disposition here is. This API should only be used on single-token (per doc) fields, so this handling we're adding/fixing is about how to handle the misuse of the API. Neither solution is great – throwing an exception is nasty since you could be fine for some time and then only on indexing enough docs, perhaps well into production, trip the exception. But then silently pretending nothing is wrong is also not great because the app then has no clue. Really this'd be a great time to use a logging framework – we'd drop a error, and then not throw an exception. Net/net I think your solution (don't throw an exception) is the lesser evil at this time, so I think we should go with that. But: I think we should also fix trunk? Ie, if hit termOrd > numDocs, silently break, instead of trying to grow the array. Because now (on trunk) if you try to load a DocTermsIndex on a large tokenized text field in a large index you'll (try to) use insane amounts of memory...
          Hide
          Uwe Schindler added a comment -

          Here patch with test for 3.x and before. Trunk patch only contains test, which passes.

          Show
          Uwe Schindler added a comment - Here patch with test for 3.x and before. Trunk patch only contains test, which passes.
          Hide
          Uwe Schindler added a comment -

          After a coffee i have seen the problem, too - stupoid

          Here is the fix for 3.x (also 3.0 and 2.9) - in trunk the fix is not needed, as there are growable arrays. Maybe we should add a simple test to all branches!

          Show
          Uwe Schindler added a comment - After a coffee i have seen the problem, too - stupoid Here is the fix for 3.x (also 3.0 and 2.9) - in trunk the fix is not needed, as there are growable arrays. Maybe we should add a simple test to all branches!
          Hide
          Uwe Schindler added a comment -

          When does it throw AIOOBE (have not tested it...)? For me the code looked fine, too, so I don't think Mike was smoking something.

          Show
          Uwe Schindler added a comment - When does it throw AIOOBE (have not tested it...)? For me the code looked fine, too, so I don't think Mike was smoking something.
          Hide
          Michael McCandless added a comment -

          I think the remove still throws unhandled exception (AIOOBE)?

          Duh, right!

          I'm not sure what I was smoking when I did this... the fix makes the exception worse since you now get a cryptic AIOOBE instead of a RuntimeException explaining what's wrong.

          Show
          Michael McCandless added a comment - I think the remove still throws unhandled exception (AIOOBE)? Duh, right! I'm not sure what I was smoking when I did this... the fix makes the exception worse since you now get a cryptic AIOOBE instead of a RuntimeException explaining what's wrong.
          Hide
          Koji Sekiguchi added a comment -

          I think we should simply remove the test, ie, if you have more terms
          than docs then the terms simply overwrite one another.

          I think the remove still throws unhandled exception (AIOOBE)?

          Show
          Koji Sekiguchi added a comment - I think we should simply remove the test, ie, if you have more terms than docs then the terms simply overwrite one another. I think the remove still throws unhandled exception (AIOOBE)?
          Hide
          Michael McCandless added a comment -

          backport

          Show
          Michael McCandless added a comment - backport
          Hide
          Earwin Burrfoot added a comment -

          +1

          Show
          Earwin Burrfoot added a comment - +1
          Hide
          Yonik Seeley added a comment -

          +1

          Show
          Yonik Seeley added a comment - +1

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development