Spinoff from SOLR-3684.
Most lucene tokenizers have some buffer size, e.g. in CharTokenizer/ICUTokenizer its char.
But the jflex tokenizers use char by default, which seems overkill. I'm not sure we really see any performance bonus by having such a huge buffer size as a default.
There is a jflex parameter to set this: I think we should consider reducing it.
In a configuration like solr, tokenizers are reused per-thread-per-field,
so these can easily stack up in RAM.
Additionally CharFilters are not reused so the configuration in e.g.
HtmlStripCharFilter might not be great since its per-document garbage.
|Status||Open [ 1 ]||Resolved [ 5 ]|
|Fix Version/s||4.0 [ 12322456 ]|
|Fix Version/s||5.0 [ 12321663 ]|
|Resolution||Fixed [ 1 ]|
|Status||Resolved [ 5 ]||Closed [ 6 ]|
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|2h 23m||1||Robert Muir||06/Aug/12 18:37|
|276d 17h 3m||1||Uwe Schindler||10/May/13 11:40|