Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6682

StandardTokenizer performance bug: buffer is unnecessarily copied when maxTokenLength doesn't change

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 5.3, 6.0
    • None
    • None
    • New

    Description

      From Piotr Idzikowski on java-user mailing list http://markmail.org/message/af26kr7fermt2tfh:

      I am developing own analyzer based on StandardAnalyzer.
      I realized that tokenizer.setMaxTokenLength is called many times.

      protected TokenStreamComponents createComponents(final String fieldName,
      final Reader reader) {
          final StandardTokenizer src = new StandardTokenizer(getVersion(),
      reader);
          src.setMaxTokenLength(maxTokenLength);
          TokenStream tok = new StandardFilter(getVersion(), src);
          tok = new LowerCaseFilter(getVersion(), tok);
          tok = new StopFilter(getVersion(), tok, stopwords);
          return new TokenStreamComponents(src, tok) {
            @Override
            protected void setReader(final Reader reader) throws IOException {
              src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
              super.setReader(reader);
            }
          };
        }
      

      Does it make sense if length stays the same? I see it finally calls this
      one( in StandardTokenizerImpl ):

      public final void setBufferSize(int numChars) {
           ZZ_BUFFERSIZE = numChars;
           char[] newZzBuffer = new char[ZZ_BUFFERSIZE];
           System.arraycopy(zzBuffer, 0, newZzBuffer, 0,
      Math.min(zzBuffer.length, ZZ_BUFFERSIZE));
           zzBuffer = newZzBuffer;
         }
      

      So it just copies old array content into the new one.

      Attachments

        Activity

          People

            sarowe Steven Rowe
            sarowe Steven Rowe
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: