Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-967

Add "tokenize documents only" task to contrib/benchmark

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.3
    • 2.3
    • modules/benchmark
    • None
    • New, Patch Available

    Description

      I've been looking at performance improvements to tokenization by
      re-using Tokens, and to help benchmark my changes I've added a new
      task called ReadTokens that just steps through all fields in a
      document, gets a TokenStream, and reads all the tokens out of it.

      EG this alg just reads all Tokens for all docs in Reuters collection:

      doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
      doc.maker.forever=false
      {ReadTokens > : *

      Attachments

        1. LUCENE-967.patch
          11 kB
          Michael McCandless
        2. LUCENE-967.take2.patch
          12 kB
          Michael McCandless
        3. LUCENE-967.take3.patch
          14 kB
          Michael McCandless

        Activity

          People

            mikemccand Michael McCandless
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: