Lucene - Core
  1. Lucene - Core
  2. LUCENE-969

Optimize the core tokenizers/analyzers & deprecate Token.termText


    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:
      New, Patch Available


      There is some "low hanging fruit" for optimizing the core tokenizers
      and analyzers:

      • Re-use a single Token instance during indexing instead of creating
        a new one for every term. To do this, I added a new method "Token
        next(Token result)" (Doron's suggestion) which means TokenStream
        may use the "Token result" as the returned Token, but is not
        required to (ie, can still return an entirely different Token if
        that is more convenient). I added default implementations for
        both next() methods in so that a TokenStream can
        choose to implement only one of the next() methods.
      • Use "char[] termBuffer" in Token instead of the "String

      Token now maintains a char[] termBuffer for holding the term's
      text. Tokenizers & filters should retrieve this buffer and
      directly alter it to put the term text in or change the term

      I only deprecated the termText() method. I still allow the ctors
      that pass in String termText, as well as setTermText(String), but
      added a NOTE about performance cost of using these methods. I
      think it's OK to keep these as convenience methods?

      After the next release, when we can remove the deprecated API, we
      should clean up to no longer maintain "either String or
      char[]" (and the initTermBuffer() private method) and always use
      the char[] termBuffer instead.

      • Re-use TokenStream instances across Fields & Documents instead of
        creating a new one for each doc. To do this I added an optional
        "reusableTokenStream(...)" to Analyzer which just defaults to
        calling tokenStream(...), and then I implemented this for the core

      I'm using the patch from LUCENE-967 for benchmarking just

      The changes above give 21% speedup (742 seconds -> 585 seconds) for
      LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
      all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
      IO system (best of 2 runs).

      If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
      (1236 sec -> 774 sec), I think because of re-using TokenStreams across

      I'm just running with this alg and recording the elapsed time:


      {ReadTokens > : *

      See this thread for discussion leading up to this:

      I also fixed Token.toString() to work correctly when termBuffer is
      used (and added unit test).

      1. LUCENE-969.patch
        58 kB
        Michael McCandless
      2. LUCENE-969.take2.patch
        62 kB
        Michael McCandless


        Michael McCandless created issue -
        Michael McCandless made changes -
        Field Original Value New Value
        Attachment LUCENE-969.patch [ 12362766 ]
        Michael McCandless made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Michael McCandless made changes -
        Lucene Fields [New] [New, Patch Available]
        Steve Rowe made changes -
        Lucene Fields [Patch Available, New] [New, Patch Available]
        Michael McCandless made changes -
        Attachment LUCENE-969.take2.patch [ 12363010 ]
        Michael McCandless made changes -
        Lucene Fields [Patch Available, New] [New, Patch Available]
        Status In Progress [ 3 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Michael Busch made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12409585 ] Default workflow, editable Closed status [ 12562652 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562652 ] jira [ 12583592 ]


          • Assignee:
            Michael McCandless
            Michael McCandless
          • Votes:
            0 Vote for this issue
            0 Start watching this issue


            • Created: