[LUCENE-1826] All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.9
Fix Version/s: 2.9
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging)

in 2.4, this worked fine.
once one sub stream was exhausted, i just started using the next stream

however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated

however, if all the sub TokenStreams share the same AttributeSource, and my "concat" TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient)

So for example, i would like to see the following constructor added to StandardTokenizer:

  public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) {
    super(source);
    ...
  }

would likewise want similar constructors added to all Tokenizer sub classes provided by lucene

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

lucene-1826.patch
21/Aug/09 23:09
26 kB
Michael Busch

Activity

People

Assignee:: Michael Busch

Reporter:: Tim Smith

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 20/Aug/09 17:38

Updated:: 28/Aug/22 12:06

Resolved:: 23/Aug/09 08:34