Lucene - Core
  1. Lucene - Core
  2. LUCENE-5472

Long terms should generate a RuntimeException, not just infoStream

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As reported on the solr-user list, when a term is greater then 2^15 bytes it is silently ignored at indexing time – a message is logged in to infoStream if enabled, but no error is thrown.

      seems like we should change this behavior (if nothing else starting in 5.0) to throw an exception.

      1. LUCENE-5472.patch
        19 kB
        Hoss Man
      2. LUCENE-5472.patch
        18 kB
        Varun Thacker
      3. LUCENE-5472.patch
        12 kB
        Michael McCandless
      4. LUCENE-5472.patch
        7 kB
        Hoss Man

        Activity

        Hide
        Hoss Man added a comment -

        Easy steps to reproduce using the example configs...

        hossman@frisbee:~$ perl -le 'print "a,aaa"; print "z," . ("Z" x 32767);' | curl 'http://localhost:8983/solr/update?header=false&fieldnames=name,long_s&rowid=id&commit=true' -H 'Content-Type: application/csv' --data-binary @- 
        <?xml version="1.0" encoding="UTF-8"?>
        <response>
        <lst name="responseHeader"><int name="status">0</int><int name="QTime">572</int></lst>
        </response>
        hossman@frisbee:~$ curl 'http://localhost:8983/solr/select?q=*:*&fl=id,name&wt=json&indent=true'{
          "responseHeader":{
            "status":0,
            "QTime":12,
            "params":{
              "fl":"id,name",
              "indent":"true",
              "q":"*:*",
              "wt":"json"}},
          "response":{"numFound":2,"start":0,"docs":[
              {
                "name":"a",
                "id":"0"},
              {
                "name":"z",
                "id":"1"}]
          }}
        hossman@frisbee:~$ curl 'http://localhost:8983/solr/select?q=long_s:*&wt=json&indent=true'
        {
          "responseHeader":{
            "status":0,
            "QTime":1,
            "params":{
              "indent":"true",
              "q":"long_s:*",
              "wt":"json"}},
          "response":{"numFound":1,"start":0,"docs":[
              {
                "name":"a",
                "long_s":"aaa",
                "id":"0",
                "_version_":1459225819107819520}]
          }}
        
        Show
        Hoss Man added a comment - Easy steps to reproduce using the example configs... hossman@frisbee:~$ perl -le 'print "a,aaa"; print "z," . ("Z" x 32767);' | curl 'http://localhost:8983/solr/update?header=false&fieldnames=name,long_s&rowid=id&commit=true' -H 'Content-Type: application/csv' --data-binary @- <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">572</int></lst> </response> hossman@frisbee:~$ curl 'http://localhost:8983/solr/select?q=*:*&fl=id,name&wt=json&indent=true'{ "responseHeader":{ "status":0, "QTime":12, "params":{ "fl":"id,name", "indent":"true", "q":"*:*", "wt":"json"}}, "response":{"numFound":2,"start":0,"docs":[ { "name":"a", "id":"0"}, { "name":"z", "id":"1"}] }} hossman@frisbee:~$ curl 'http://localhost:8983/solr/select?q=long_s:*&wt=json&indent=true' { "responseHeader":{ "status":0, "QTime":1, "params":{ "indent":"true", "q":"long_s:*", "wt":"json"}}, "response":{"numFound":1,"start":0,"docs":[ { "name":"a", "long_s":"aaa", "id":"0", "_version_":1459225819107819520}] }}
        Hide
        Hoss Man added a comment -

        Things i'm confident of...

        • the limit is IndexWriter.MAX_TERM_LENGTH
        • it is not configurable
        • a message is written to the infoStream by DocFieldProcessor when this is exceeded...
              if (docState.maxTermPrefix != null && docState.infoStream.isEnabled("IW")) {
                docState.infoStream.message("IW", "WARNING: document contains at least one immense term (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '" + docState.maxTermPrefix + "...'");
                docState.maxTermPrefix = null;
              }
            }
          

        Things i think i understand, but am not certain of...

        • by the time DocumentsWriterPerThread sees this problem, and logs this to the infoStream, it's already too late to throw an exception up the call stack (because it's happening in another thread)

        Rough idea only half considered...

        • update the tokenstream producers in Solr to explicitly check the terms they are about to return and throw an exception if they exceed this length (mention using LengthFilter in this error message)
        • this wouldn't help if people use their own concrete Analyzer class – but it would solve the problem with things like StrField, or anytime analysis factories are used
        • we could conceivable wrap any user configured concrete Analyzer class to do this check – but i'm not sure we should, since it will add cycles and the Analyzer might already be well behaved.

        thoughts?

        Show
        Hoss Man added a comment - Things i'm confident of... the limit is IndexWriter.MAX_TERM_LENGTH it is not configurable a message is written to the infoStream by DocFieldProcessor when this is exceeded... if (docState.maxTermPrefix != null && docState.infoStream.isEnabled( "IW" )) { docState.infoStream.message( "IW" , "WARNING: document contains at least one immense term (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '" + docState.maxTermPrefix + "...'" ); docState.maxTermPrefix = null ; } } Things i think i understand, but am not certain of... by the time DocumentsWriterPerThread sees this problem, and logs this to the infoStream, it's already too late to throw an exception up the call stack (because it's happening in another thread) Rough idea only half considered... update the tokenstream producers in Solr to explicitly check the terms they are about to return and throw an exception if they exceed this length (mention using LengthFilter in this error message) this wouldn't help if people use their own concrete Analyzer class – but it would solve the problem with things like StrField, or anytime analysis factories are used we could conceivable wrap any user configured concrete Analyzer class to do this check – but i'm not sure we should, since it will add cycles and the Analyzer might already be well behaved. thoughts?
        Hide
        Michael McCandless added a comment -

        Actually, I think Lucene should just throw an exc when this happens? DWPT should be the right place (it isn't a different thread)...

        Separately this limit is absurdly large

        Show
        Michael McCandless added a comment - Actually, I think Lucene should just throw an exc when this happens? DWPT should be the right place (it isn't a different thread)... Separately this limit is absurdly large
        Hide
        Hoss Man added a comment -

        I think Lucene should just throw an exc when this happens? ... (it isn't a different thread)

        I wasn't entire sure about that – and since it currently does an infoStream but does not throw an exception, i assumed that was because of the threading.

        If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that.

        Show
        Hoss Man added a comment - I think Lucene should just throw an exc when this happens? ... (it isn't a different thread) I wasn't entire sure about that – and since it currently does an infoStream but does not throw an exception, i assumed that was because of the threading. If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that.
        Hide
        Michael McCandless added a comment -

        If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that.

        +1

        I don't think this leniency is good: it hides that your app is indexing trash.

        Show
        Michael McCandless added a comment - If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that. +1 I don't think this leniency is good: it hides that your app is indexing trash.
        Hide
        Hoss Man added a comment -

        revising summary & description in prep for converting to LUCENE issue

        Show
        Hoss Man added a comment - revising summary & description in prep for converting to LUCENE issue
        Hide
        Hoss Man added a comment -

        Here's a quick pass at trying to fix this along with a test.

        at the moment the test fails because i didn't see any immediately obvious way to get the fieldname into the exception message, and that seems kind of key to making it useful (yes a byte prefix of the term is there, but for most people indexing text that's not going to be immediately helpful to them to understand where to look for the long term)

        I haven't dug down deeper to see if it would be safe/easy to just add the fieldname to docState.maxTermPrefix (as a prefix on the prefix) nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored.

        Show
        Hoss Man added a comment - Here's a quick pass at trying to fix this along with a test. at the moment the test fails because i didn't see any immediately obvious way to get the fieldname into the exception message, and that seems kind of key to making it useful (yes a byte prefix of the term is there, but for most people indexing text that's not going to be immediately helpful to them to understand where to look for the long term) I haven't dug down deeper to see if it would be safe/easy to just add the fieldname to docState.maxTermPrefix (as a prefix on the prefix) nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored.
        Hide
        Hoss Man added a comment -

        nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored.

        no surprises jumped out at me when running all tests, although there was an un-surprising failure from TestIndexWriter.testWickedLongTerm which asserts the exact opposite of the new test i added here: that a doc with a "wicket long" term can still be added.

        Assuming people think the overall change should happen, removing/fixing that test would be trivial.

        Show
        Hoss Man added a comment - nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored. no surprises jumped out at me when running all tests, although there was an un-surprising failure from TestIndexWriter.testWickedLongTerm which asserts the exact opposite of the new test i added here: that a doc with a "wicket long" term can still be added. Assuming people think the overall change should happen, removing/fixing that test would be trivial.
        Hide
        Michael McCandless added a comment -

        Thanks Hoss!

        I tweaked the patch a bit, to get the field name into the exc message, and moved where we throw the exception. I also fixed testWickedLongTerm to verify the exc is thrown, but other documents just indexed are not lost.

        Show
        Michael McCandless added a comment - Thanks Hoss! I tweaked the patch a bit, to get the field name into the exc message, and moved where we throw the exception. I also fixed testWickedLongTerm to verify the exc is thrown, but other documents just indexed are not lost.
        Hide
        Hoss Man added a comment -

        I tweaked the patch a bit

        Looks pretty good, although "maxTermField" in DocumentsWriterPerThread looks like a new unused variable ... cruft from a different approach you were going to take?

        Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x?

        Can/should we make throwing an exception dependent on the Version used in the IndexWriterConfig? (it's not clear to me if enough IW context is passed down to DocInverterPerField currently to even access the Version)

        Show
        Hoss Man added a comment - I tweaked the patch a bit Looks pretty good, although "maxTermField" in DocumentsWriterPerThread looks like a new unused variable ... cruft from a different approach you were going to take? Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x? Can/should we make throwing an exception dependent on the Version used in the IndexWriterConfig? (it's not clear to me if enough IW context is passed down to DocInverterPerField currently to even access the Version)
        Hide
        Robert Muir added a comment -

        I would prefer we just change it in trunk and 4.x, rather than something tricky.

        Show
        Robert Muir added a comment - I would prefer we just change it in trunk and 4.x, rather than something tricky.
        Hide
        Michael McCandless added a comment -

        cruft from a different approach you were going to take?

        Woops, yes, nuke it please

        Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x?

        I think a hard break here is fine; I think it's a service to our users if we notify them that they are indexing trash.

        Show
        Michael McCandless added a comment - cruft from a different approach you were going to take? Woops, yes, nuke it please Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x? I think a hard break here is fine; I think it's a service to our users if we notify them that they are indexing trash.
        Hide
        Varun Thacker added a comment -

        Attached new patch -
        1. Removed unused variable in DWPT
        2. Added Solr Tests

        Show
        Varun Thacker added a comment - Attached new patch - 1. Removed unused variable in DWPT 2. Added Solr Tests
        Hide
        Hoss Man added a comment -

        I would prefer we just change it in trunk and 4.x, rather than something tricky.

        Works for me – we can note it in the backcompat section

        Added Solr Tests

        Thanks for writing those tests Varun!

        I've tweaked the patch slightly...

        • renamed Varun's Solr TestIndexing -> TestExceedMaxTermLength so it's more clear what we're testing
        • fixed a variable typo in the test (longFielValue -> longFieldValue)
        • fixed patch so that "ant precommit" would pass (String.format was being used w/o a Locale)
        • removed a stale nocommit comment i'd put in lucene's TestExceedMaxTermLength that Mike fixed a few patches ago
        • fixed the jdocs for IndexWriter.addDocument to refer to MAX_TERM_LENGTH (ironically it already incorrectly said that terms which were too long would cause an IllegalArgumentException

        I think this is ready? speak now or i'll commit tomorrow.

        Show
        Hoss Man added a comment - I would prefer we just change it in trunk and 4.x, rather than something tricky. Works for me – we can note it in the backcompat section Added Solr Tests Thanks for writing those tests Varun! I've tweaked the patch slightly... renamed Varun's Solr TestIndexing -> TestExceedMaxTermLength so it's more clear what we're testing fixed a variable typo in the test (longFielValue -> longFieldValue) fixed patch so that "ant precommit" would pass (String.format was being used w/o a Locale) removed a stale nocommit comment i'd put in lucene's TestExceedMaxTermLength that Mike fixed a few patches ago fixed the jdocs for IndexWriter.addDocument to refer to MAX_TERM_LENGTH (ironically it already incorrectly said that terms which were too long would cause an IllegalArgumentException I think this is ready? speak now or i'll commit tomorrow.
        Hide
        ASF subversion and git services added a comment -

        Commit 1574595 from hossman@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1574595 ]

        LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH

        Show
        ASF subversion and git services added a comment - Commit 1574595 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1574595 ] LUCENE-5472 : IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH
        Hide
        ASF subversion and git services added a comment -

        Commit 1574637 from hossman@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1574637 ]

        LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH (merge r1574595)

        Show
        ASF subversion and git services added a comment - Commit 1574637 from hossman@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1574637 ] LUCENE-5472 : IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH (merge r1574595)
        Hide
        Hoss Man added a comment -

        thanks everybody!

        Show
        Hoss Man added a comment - thanks everybody!
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Hoss Man
            Reporter:
            Hoss Man
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development