Lucene - Core
  1. Lucene - Core
  2. LUCENE-2142

FieldCache.getStringIndex should not throw exception if term count exceeds doc count

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.4, 3.0.3, 3.1, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff of LUCENE-2133/LUCENE-831.

      Currently FieldCache cannot handle more than one value per field.
      We may someday want to fix that... but until that day:

      FieldCache.getStringIndex currently does a simplistic check to try to
      catch when you've accidentally allowed more than one term per field,
      by testing if the number of unique terms exceeds the number of
      documents.

      The problem is, this is not a perfect check, in that it allows false
      negatives (you could have more than one term per field for some docs
      and the check won't catch you).

      Further, the exception thrown is the unchecked RuntimeException.

      So this means... you could happily think all is good, until some day,
      well into production, once you've updated enough docs, suddenly the
      check will catch you and throw an unhandled exception, stopping all
      searches [that need to sort by this string field] in their tracks.
      It's not gracefully degrading.

      I think we should simply remove the test, ie, if you have more terms
      than docs then the terms simply overwrite one another.

      1. LUCENE-2142-fix-3x.patch
        2 kB
        Uwe Schindler
      2. LUCENE-2142-fix-trunk.patch
        1 kB
        Uwe Schindler
      3. LUCENE-2142-trunk.patch
        3 kB
        Michael McCandless

        Issue Links

          Activity

          Michael McCandless created issue -
          Hide
          Yonik Seeley added a comment -

          +1

          Show
          Yonik Seeley added a comment - +1
          Michael McCandless made changes -
          Field Original Value New Value
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Earwin Burrfoot added a comment -

          +1

          Show
          Earwin Burrfoot added a comment - +1
          Hide
          Michael McCandless added a comment -

          backport

          Show
          Michael McCandless added a comment - backport
          Michael McCandless made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Michael McCandless made changes -
          Fix Version/s 2.9.3 [ 12314799 ]
          Michael McCandless made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Fix Version/s 3.0.2 [ 12314798 ]
          Fix Version/s 3.1 [ 12314822 ]
          Resolution Fixed [ 1 ]
          Hide
          Koji Sekiguchi added a comment -

          I think we should simply remove the test, ie, if you have more terms
          than docs then the terms simply overwrite one another.

          I think the remove still throws unhandled exception (AIOOBE)?

          Show
          Koji Sekiguchi added a comment - I think we should simply remove the test, ie, if you have more terms than docs then the terms simply overwrite one another. I think the remove still throws unhandled exception (AIOOBE)?
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Michael McCandless added a comment -

          I think the remove still throws unhandled exception (AIOOBE)?

          Duh, right!

          I'm not sure what I was smoking when I did this... the fix makes the exception worse since you now get a cryptic AIOOBE instead of a RuntimeException explaining what's wrong.

          Show
          Michael McCandless added a comment - I think the remove still throws unhandled exception (AIOOBE)? Duh, right! I'm not sure what I was smoking when I did this... the fix makes the exception worse since you now get a cryptic AIOOBE instead of a RuntimeException explaining what's wrong.
          Michael McCandless made changes -
          Resolution Fixed [ 1 ]
          Status Closed [ 6 ] Reopened [ 4 ]
          Hide
          Uwe Schindler added a comment -

          When does it throw AIOOBE (have not tested it...)? For me the code looked fine, too, so I don't think Mike was smoking something.

          Show
          Uwe Schindler added a comment - When does it throw AIOOBE (have not tested it...)? For me the code looked fine, too, so I don't think Mike was smoking something.
          Hide
          Uwe Schindler added a comment -

          After a coffee i have seen the problem, too - stupoid

          Here is the fix for 3.x (also 3.0 and 2.9) - in trunk the fix is not needed, as there are growable arrays. Maybe we should add a simple test to all branches!

          Show
          Uwe Schindler added a comment - After a coffee i have seen the problem, too - stupoid Here is the fix for 3.x (also 3.0 and 2.9) - in trunk the fix is not needed, as there are growable arrays. Maybe we should add a simple test to all branches!
          Uwe Schindler made changes -
          Attachment LUCENE-2142-fix.patch [ 12447525 ]
          Hide
          Uwe Schindler added a comment -

          Here patch with test for 3.x and before. Trunk patch only contains test, which passes.

          Show
          Uwe Schindler added a comment - Here patch with test for 3.x and before. Trunk patch only contains test, which passes.
          Uwe Schindler made changes -
          Attachment LUCENE-2142-fix-3x.patch [ 12447527 ]
          Attachment LUCENE-2142-fix-trunk.patch [ 12447528 ]
          Uwe Schindler made changes -
          Attachment LUCENE-2142-fix.patch [ 12447525 ]
          Hide
          Michael McCandless added a comment -

          Thanks Uwe!

          So your fix avoids any exception altogether. On 3x, you just stop
          loading when we hit a termOrd > number of docs. On trunk, we keep
          loading, simply growing the array as needed.

          I'm torn on what the best disposition here is. This API should only
          be used on single-token (per doc) fields, so this handling we're
          adding/fixing is about how to handle the misuse of the API.

          Neither solution is great – throwing an exception is nasty since you
          could be fine for some time and then only on indexing enough docs,
          perhaps well into production, trip the exception. But then silently
          pretending nothing is wrong is also not great because the app then has
          no clue.

          Really this'd be a great time to use a logging framework – we'd drop
          a error, and then not throw an exception.

          Net/net I think your solution (don't throw an exception) is the lesser
          evil at this time, so I think we should go with that.

          But: I think we should also fix trunk? Ie, if hit termOrd > numDocs,
          silently break, instead of trying to grow the array. Because now (on
          trunk) if you try to load a DocTermsIndex on a large tokenized text
          field in a large index you'll (try to) use insane amounts of memory...

          Show
          Michael McCandless added a comment - Thanks Uwe! So your fix avoids any exception altogether. On 3x, you just stop loading when we hit a termOrd > number of docs. On trunk, we keep loading, simply growing the array as needed. I'm torn on what the best disposition here is. This API should only be used on single-token (per doc) fields, so this handling we're adding/fixing is about how to handle the misuse of the API. Neither solution is great – throwing an exception is nasty since you could be fine for some time and then only on indexing enough docs, perhaps well into production, trip the exception. But then silently pretending nothing is wrong is also not great because the app then has no clue. Really this'd be a great time to use a logging framework – we'd drop a error, and then not throw an exception. Net/net I think your solution (don't throw an exception) is the lesser evil at this time, so I think we should go with that. But: I think we should also fix trunk? Ie, if hit termOrd > numDocs, silently break, instead of trying to grow the array. Because now (on trunk) if you try to load a DocTermsIndex on a large tokenized text field in a large index you'll (try to) use insane amounts of memory...
          Uwe Schindler made changes -
          Fix Version/s 2.9.4 [ 12315148 ]
          Fix Version/s 3.0.3 [ 12315147 ]
          Fix Version/s 3.0.2 [ 12314798 ]
          Fix Version/s 2.9.3 [ 12314799 ]
          Hide
          Uwe Schindler added a comment -

          Committed in trunk, 3x, 3.0, 2.9 branches.

          Trunk is still missing the escape-branch when term count exceeds doc count.

          Show
          Uwe Schindler added a comment - Committed in trunk, 3x, 3.0, 2.9 branches. Trunk is still missing the escape-branch when term count exceeds doc count.
          Hide
          Michael McCandless added a comment -

          Patch to fix trunk to stop loading when number of terms > maxDoc.

          Show
          Michael McCandless added a comment - Patch to fix trunk to stop loading when number of terms > maxDoc.
          Michael McCandless made changes -
          Attachment LUCENE-2142-trunk.patch [ 12447940 ]
          Michael McCandless made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hoss Man made changes -
          Link This issue is related to SOLR-2249 [ SOLR-2249 ]
          Hoss Man made changes -
          Link This issue breaks SOLR-2339 [ SOLR-2339 ]
          Mark Thomas made changes -
          Workflow jira [ 12484361 ] Default workflow, editable Closed status [ 12564161 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564161 ] jira [ 12583913 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          18h 6m 1 Michael McCandless 11/Dec/09 11:38
          Resolved Resolved Reopened Reopened
          170d 50m 1 Michael McCandless 30/May/10 13:29
          Closed Closed Reopened Reopened
          14h 9m 1 Michael McCandless 18/Jun/10 23:13
          Reopened Reopened Resolved Resolved
          7d 16h 49m 2 Michael McCandless 24/Jun/10 13:10
          Resolved Resolved Closed Closed
          176d 19h 22m 2 Uwe Schindler 01/Dec/10 14:49

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development