Lucene - Core
  1. Lucene - Core
  2. LUCENE-969

Optimize the core tokenizers/analyzers & deprecate Token.termText

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      There is some "low hanging fruit" for optimizing the core tokenizers
      and analyzers:

      • Re-use a single Token instance during indexing instead of creating
        a new one for every term. To do this, I added a new method "Token
        next(Token result)" (Doron's suggestion) which means TokenStream
        may use the "Token result" as the returned Token, but is not
        required to (ie, can still return an entirely different Token if
        that is more convenient). I added default implementations for
        both next() methods in TokenStream.java so that a TokenStream can
        choose to implement only one of the next() methods.
      • Use "char[] termBuffer" in Token instead of the "String
        termText".

      Token now maintains a char[] termBuffer for holding the term's
      text. Tokenizers & filters should retrieve this buffer and
      directly alter it to put the term text in or change the term
      text.

      I only deprecated the termText() method. I still allow the ctors
      that pass in String termText, as well as setTermText(String), but
      added a NOTE about performance cost of using these methods. I
      think it's OK to keep these as convenience methods?

      After the next release, when we can remove the deprecated API, we
      should clean up Token.java to no longer maintain "either String or
      char[]" (and the initTermBuffer() private method) and always use
      the char[] termBuffer instead.

      • Re-use TokenStream instances across Fields & Documents instead of
        creating a new one for each doc. To do this I added an optional
        "reusableTokenStream(...)" to Analyzer which just defaults to
        calling tokenStream(...), and then I implemented this for the core
        analyzers.

      I'm using the patch from LUCENE-967 for benchmarking just
      tokenization.

      The changes above give 21% speedup (742 seconds -> 585 seconds) for
      LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
      all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
      IO system (best of 2 runs).

      If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
      (1236 sec -> 774 sec), I think because of re-using TokenStreams across
      docs.

      I'm just running with this alg and recording the elapsed time:

      analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
      doc.tokenize.log.step=50000
      docs.file=/lucene/wikifull.txt
      doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
      doc.tokenized=true
      doc.maker.forever=false

      {ReadTokens > : *

      See this thread for discussion leading up to this:

      http://www.gossamer-threads.com/lists/lucene/java-dev/51283

      I also fixed Token.toString() to work correctly when termBuffer is
      used (and added unit test).

      1. LUCENE-969.patch
        58 kB
        Michael McCandless
      2. LUCENE-969.take2.patch
        62 kB
        Michael McCandless

        Activity

        Michael McCandless created issue -
        Hide
        Michael McCandless added a comment -

        First-cut patch. All tests pass. I still need do fix some javadocs
        but otherwise I think this is close...

        Show
        Michael McCandless added a comment - First-cut patch. All tests pass. I still need do fix some javadocs but otherwise I think this is close...
        Michael McCandless made changes -
        Field Original Value New Value
        Attachment LUCENE-969.patch [ 12362766 ]
        Michael McCandless made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Michael McCandless made changes -
        Lucene Fields [New] [New, Patch Available]
        Hide
        Michael Busch added a comment -

        Hi Mike,

        this is just an idea to keep Token.java simpler, but I haven't really thought about all the consequences. So feel free to tell me that it's a bad idea

        Could you add a new class TermBuffer including the char[] array and your resize() logic that would implement CharSequence? Then you could get rid of the duplicate constructors and setters for String and char[], because String also implements CharSequence. And CharSequence has the method charAt(int index), so it should be almost as fast as directly accessing the char array in case the TermBuffer is used. You would need to change the existing constructors and setter to take a CharSequence object instead of a String, but this is not an API change as users can still pass in a String object. And then you would just need to add a new constructor with offset and length and a similiar setter. Thoughts?

        Show
        Michael Busch added a comment - Hi Mike, this is just an idea to keep Token.java simpler, but I haven't really thought about all the consequences. So feel free to tell me that it's a bad idea Could you add a new class TermBuffer including the char[] array and your resize() logic that would implement CharSequence? Then you could get rid of the duplicate constructors and setters for String and char[], because String also implements CharSequence. And CharSequence has the method charAt(int index), so it should be almost as fast as directly accessing the char array in case the TermBuffer is used. You would need to change the existing constructors and setter to take a CharSequence object instead of a String, but this is not an API change as users can still pass in a String object. And then you would just need to add a new constructor with offset and length and a similiar setter. Thoughts?
        Hide
        Yonik Seeley added a comment -

        > [...] implement CharSequence
        I think CharSequence is Java5

        Show
        Yonik Seeley added a comment - > [...] implement CharSequence I think CharSequence is Java5
        Hide
        Steve Rowe added a comment -
        Show
        Steve Rowe added a comment - CharSequence was introduced in 1.4: http://java.sun.com/j2se/1.4.2/docs/api/java/lang/CharSequence.html
        Steve Rowe made changes -
        Lucene Fields [Patch Available, New] [New, Patch Available]
        Hide
        Michael McCandless added a comment -

        > Could you add a new class TermBuffer including the char[] array and
        > your resize() logic that would implement CharSequence? Then you
        > could get rid of the duplicate constructors and setters for String
        > and char[], because String also implements CharSequence. And
        > CharSequence has the method charAt(int index), so it should be
        > almost as fast as directly accessing the char array in case the
        > TermBuffer is used. You would need to change the existing
        > constructors and setter to take a CharSequence object instead of a
        > String, but this is not an API change as users can still pass in a
        > String object. And then you would just need to add a new constructor
        > with offset and length and a similiar setter. Thoughts?

        If I understand this, consumers of the Token API would need to
        separately construct/reuse their own TermBuffer in order to then set
        the Token to new text? This could then slow down applications that
        still need to make a new Token instance for every term in their
        documents because now 2 class instances would be created for every
        token.

        Also I don't think this would make the public API simpler? People
        already understand String and char[] as normal ways to represent text
        content; if we add our own new class here that's another
        Lucene-specific way to represent text content that people will have to
        learn.

        Internally, Token looks more complex than it will be in the future,
        just because we need the initTermBuffer() calls until we can remove
        the deprecated attr (String termText) and method (termText()).

        I believe having String and char[] variants of text-processing APIs is
        fairly common practice and is reasonable. EG the PorterStemmer has 4
        "stem" methods (one accepting String and 3 accepting char[] with or
        without offset/length).

        Show
        Michael McCandless added a comment - > Could you add a new class TermBuffer including the char[] array and > your resize() logic that would implement CharSequence? Then you > could get rid of the duplicate constructors and setters for String > and char[], because String also implements CharSequence. And > CharSequence has the method charAt(int index), so it should be > almost as fast as directly accessing the char array in case the > TermBuffer is used. You would need to change the existing > constructors and setter to take a CharSequence object instead of a > String, but this is not an API change as users can still pass in a > String object. And then you would just need to add a new constructor > with offset and length and a similiar setter. Thoughts? If I understand this, consumers of the Token API would need to separately construct/reuse their own TermBuffer in order to then set the Token to new text? This could then slow down applications that still need to make a new Token instance for every term in their documents because now 2 class instances would be created for every token. Also I don't think this would make the public API simpler? People already understand String and char[] as normal ways to represent text content; if we add our own new class here that's another Lucene-specific way to represent text content that people will have to learn. Internally, Token looks more complex than it will be in the future, just because we need the initTermBuffer() calls until we can remove the deprecated attr (String termText) and method (termText()). I believe having String and char[] variants of text-processing APIs is fairly common practice and is reasonable. EG the PorterStemmer has 4 "stem" methods (one accepting String and 3 accepting char[] with or without offset/length).
        Hide
        Michael Busch added a comment -

        > This could then slow down applications that still need to make a
        > new Token instance for every term in their documents because now
        > 2 class instances would be created for every token.

        Yes that's true. I was thinking that in the new optimized way, where
        people reuse the same Token and char[] instance, this wouldn't harm
        since TermBuffer would basically just be a wrapper around a char
        array. But you are right, this would be an overhead in apps that
        can't reuse the Tokens.

        > if we add our own new class here that's another
        > Lucene-specific way to represent text content that people will have to
        > learn.

        Agree. I was just thinking that the CharSequence approach would reduce the
        number of setters and constructors, but you're right, we're going to remove
        the ones that take Strings anyway in a future version.

        OK, the API of this patch looks good to me! +1
        Thanks for your detailed answer!

        Show
        Michael Busch added a comment - > This could then slow down applications that still need to make a > new Token instance for every term in their documents because now > 2 class instances would be created for every token. Yes that's true. I was thinking that in the new optimized way, where people reuse the same Token and char[] instance, this wouldn't harm since TermBuffer would basically just be a wrapper around a char array. But you are right, this would be an overhead in apps that can't reuse the Tokens. > if we add our own new class here that's another > Lucene-specific way to represent text content that people will have to > learn. Agree. I was just thinking that the CharSequence approach would reduce the number of setters and constructors, but you're right, we're going to remove the ones that take Strings anyway in a future version. OK, the API of this patch looks good to me! +1 Thanks for your detailed answer!
        Hide
        Michael McCandless added a comment -

        Updated patch based on recent commits; fixed up the javadocs and a few
        other small things. I think this is ready to commit but I'll wait a
        few days for more comments...

        Show
        Michael McCandless added a comment - Updated patch based on recent commits; fixed up the javadocs and a few other small things. I think this is ready to commit but I'll wait a few days for more comments...
        Michael McCandless made changes -
        Attachment LUCENE-969.take2.patch [ 12363010 ]
        Michael McCandless made changes -
        Lucene Fields [Patch Available, New] [New, Patch Available]
        Status In Progress [ 3 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Michael Busch made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12409585 ] Default workflow, editable Closed status [ 12562652 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562652 ] jira [ 12583592 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open In Progress In Progress
        1m 52s 1 Michael McCandless 30/Jul/07 13:42
        In Progress In Progress Resolved Resolved
        11d 23h 35m 1 Michael McCandless 11/Aug/07 13:17
        Resolved Resolved Closed Closed
        166d 15h 6m 1 Michael Busch 25/Jan/08 03:24

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development