Lucene - Core
  1. Lucene - Core
  2. LUCENE-6682

StandardTokenizer performance bug: buffer is unnecessarily copied when maxTokenLength doesn't change

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      From Piotr Idzikowski on java-user mailing list http://markmail.org/message/af26kr7fermt2tfh:

      I am developing own analyzer based on StandardAnalyzer.
      I realized that tokenizer.setMaxTokenLength is called many times.

      protected TokenStreamComponents createComponents(final String fieldName,
      final Reader reader) {
          final StandardTokenizer src = new StandardTokenizer(getVersion(),
      reader);
          src.setMaxTokenLength(maxTokenLength);
          TokenStream tok = new StandardFilter(getVersion(), src);
          tok = new LowerCaseFilter(getVersion(), tok);
          tok = new StopFilter(getVersion(), tok, stopwords);
          return new TokenStreamComponents(src, tok) {
            @Override
            protected void setReader(final Reader reader) throws IOException {
              src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
              super.setReader(reader);
            }
          };
        }
      

      Does it make sense if length stays the same? I see it finally calls this
      one( in StandardTokenizerImpl ):

      public final void setBufferSize(int numChars) {
           ZZ_BUFFERSIZE = numChars;
           char[] newZzBuffer = new char[ZZ_BUFFERSIZE];
           System.arraycopy(zzBuffer, 0, newZzBuffer, 0,
      Math.min(zzBuffer.length, ZZ_BUFFERSIZE));
           zzBuffer = newZzBuffer;
         }
      

      So it just copies old array content into the new one.

        Activity

        Hide
        Steve Rowe added a comment - - edited

        In setMaxTokenLength() we should call setBufferSize() only if the length has changed.

        Also, setMaxTokenLength() maxes out the buffer size at 1M chars (i.e. UTF-16 code units), but getMaxTokenLength() will lie and report the requested length, even though it exceeded 1M chars, and the buffer length, maxed out at 1M chars, is what really controls max token length. Not cool. We should instead throw an exception when the requested maxTokenLength exceeds 1M.

        Here's a patch that fixes both:

        Index: lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
        ===================================================================
        --- lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java	(revision 1691570)
        +++ lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java	(working copy)
        @@ -88,6 +88,8 @@
             "<HANGUL>"
           };
           
        +  public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024;
        +  
           private int skippedPositions;
         
           private int maxTokenLength = StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH;
        @@ -97,9 +99,13 @@
           public void setMaxTokenLength(int length) {
             if (length < 1) {
               throw new IllegalArgumentException("maxTokenLength must be greater than zero");
        +    } else if (length > MAX_TOKEN_LENGTH_LIMIT) {
        +      throw new IllegalArgumentException("maxTokenLength may not exceed " + MAX_TOKEN_LENGTH_LIMIT);
             }
        -    this.maxTokenLength = length;
        -    scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer size to 1M chars
        +    if (length != maxTokenLength) {
        +      maxTokenLength = length;
        +      scanner.setBufferSize(length);
        +    }
           }
         
           /** @see #setMaxTokenLength */
        
        Show
        Steve Rowe added a comment - - edited In setMaxTokenLength() we should call setBufferSize() only if the length has changed. Also, setMaxTokenLength() maxes out the buffer size at 1M chars (i.e. UTF-16 code units), but getMaxTokenLength() will lie and report the requested length, even though it exceeded 1M chars, and the buffer length, maxed out at 1M chars, is what really controls max token length. Not cool. We should instead throw an exception when the requested maxTokenLength exceeds 1M. Here's a patch that fixes both: Index: lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java =================================================================== --- lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java (revision 1691570) +++ lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java (working copy) @@ -88,6 +88,8 @@ "<HANGUL>" }; + public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024; + private int skippedPositions; private int maxTokenLength = StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH; @@ -97,9 +99,13 @@ public void setMaxTokenLength( int length) { if (length < 1) { throw new IllegalArgumentException( "maxTokenLength must be greater than zero" ); + } else if (length > MAX_TOKEN_LENGTH_LIMIT) { + throw new IllegalArgumentException( "maxTokenLength may not exceed " + MAX_TOKEN_LENGTH_LIMIT); } - this .maxTokenLength = length; - scanner.setBufferSize( Math .min(length, 1024 * 1024)); // limit buffer size to 1M chars + if (length != maxTokenLength) { + maxTokenLength = length; + scanner.setBufferSize(length); + } } /** @see #setMaxTokenLength */
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1691604 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1691604 ]

        LUCENE-6682: StandardTokenizer performance bug: scanner buffer is unnecessarily copied when maxTokenLength doesn't change. Also stop silently maxing out buffer size (and effectively also max token length) at 1M chars, but instead throw an exception from setMaxTokenLength() when the given length is greater than 1M chars.

        Show
        ASF subversion and git services added a comment - Commit 1691604 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1691604 ] LUCENE-6682 : StandardTokenizer performance bug: scanner buffer is unnecessarily copied when maxTokenLength doesn't change. Also stop silently maxing out buffer size (and effectively also max token length) at 1M chars, but instead throw an exception from setMaxTokenLength() when the given length is greater than 1M chars.
        Hide
        ASF subversion and git services added a comment -

        Commit 1691605 from Steve Rowe in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1691605 ]

        LUCENE-6682: StandardTokenizer performance bug: scanner buffer is unnecessarily copied when maxTokenLength doesn't change. Also stop silently maxing out buffer size (and effectively also max token length) at 1M chars, but instead throw an exception from setMaxTokenLength() when the given length is greater than 1M chars. (merged trunk r1691604)

        Show
        ASF subversion and git services added a comment - Commit 1691605 from Steve Rowe in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1691605 ] LUCENE-6682 : StandardTokenizer performance bug: scanner buffer is unnecessarily copied when maxTokenLength doesn't change. Also stop silently maxing out buffer size (and effectively also max token length) at 1M chars, but instead throw an exception from setMaxTokenLength() when the given length is greater than 1M chars. (merged trunk r1691604)
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_5x.

        Thanks for reporting, Piotr!

        Show
        Steve Rowe added a comment - Committed to trunk and branch_5x. Thanks for reporting, Piotr!
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development