Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-985

AIOOB thrown when length of termText is longer than 16384 characters (ArrayIndexOutOfBoundsException)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      DocumentsWriter has a max term length of 16384; if you cross that you
      get an unfriendly ArrayIndexOutOfBoundsException. We should fix to raise a clearer exception.

      1. LUCENE-985.patch
        3 kB
        Michael McCandless

        Activity

        Hide
        hossman Hoss Man added a comment -

        As a clarification point for people who stumble upon this issue years from now after encountering whatever exception we put in place of the current one...

        why is there a max termText length?

        Show
        hossman Hoss Man added a comment - As a clarification point for people who stumble upon this issue years from now after encountering whatever exception we put in place of the current one... why is there a max termText length?
        Hide
        hossman Hoss Man added a comment -

        (making summary longer to improve searchability of the exception for other people who may get bit by it)

        Show
        hossman Hoss Man added a comment - (making summary longer to improve searchability of the exception for other people who may get bit by it)
        Hide
        mikemccand Michael McCandless added a comment -

        > As a clarification point for people who stumble upon this issue
        > years from now after encountering whatever exception we put in place
        > of the current one...why is there a max termText length?

        This is because DocumentsWriter packs the term text for each unique
        term seen into a pool of char[] blocks of 16384 chars each (to avoid
        GC overhead of each separate String). So, every time a new term is
        seen, it puts it at the end of the current block; when there's not
        enough space it allocates another block from the pool. So a given
        term must fit entirely into a single block.

        Show
        mikemccand Michael McCandless added a comment - > As a clarification point for people who stumble upon this issue > years from now after encountering whatever exception we put in place > of the current one...why is there a max termText length? This is because DocumentsWriter packs the term text for each unique term seen into a pool of char[] blocks of 16384 chars each (to avoid GC overhead of each separate String). So, every time a new term is seen, it puts it at the end of the current block; when there's not enough space it allocates another block from the pool. So a given term must fit entirely into a single block.
        Hide
        karl.wettin Karl Wettin added a comment -

        I doubt anyone will have a problem with the limit. And if they hit the exception it is probably due to bad end-user input of some kind. I always run a token filter that leaves out any token larger than 250 charachters or so, depending on the application. (It was quite accidential that I hit this AIOOBE.)

        That would also be a recommendation I think makes sense in the documentation people will look up when hitting the exception.

        Show
        karl.wettin Karl Wettin added a comment - I doubt anyone will have a problem with the limit. And if they hit the exception it is probably due to bad end-user input of some kind. I always run a token filter that leaves out any token larger than 250 charachters or so, depending on the application. (It was quite accidential that I hit this AIOOBE.) That would also be a recommendation I think makes sense in the documentation people will look up when hitting the exception.
        Hide
        mikemccand Michael McCandless added a comment -

        > I doubt anyone will have a problem with the limit. And if they hit
        > the exception it is probably due to bad end-user input of some
        > kind. I always run a token filter that leaves out any token larger
        > than 250 charachters or so, depending on the application. (It was
        > quite accidential that I hit this AIOOBE.)

        Agreed!

        > That would also be a recommendation I think makes sense in the
        > documentation people will look up when hitting the exception.

        I've added a blurb in javadoc for IndexWriter.addDocument explaining
        this limit.

        Thanks for catching this Karl!

        Show
        mikemccand Michael McCandless added a comment - > I doubt anyone will have a problem with the limit. And if they hit > the exception it is probably due to bad end-user input of some > kind. I always run a token filter that leaves out any token larger > than 250 charachters or so, depending on the application. (It was > quite accidential that I hit this AIOOBE.) Agreed! > That would also be a recommendation I think makes sense in the > documentation people will look up when hitting the exception. I've added a blurb in javadoc for IndexWriter.addDocument explaining this limit. Thanks for catching this Karl!

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development