Details
-
Improvement
-
Status: Resolved
-
Normal
-
Resolution: Fixed
Description
Currently SASI offers only two tokenizer options:
- NonTokenizerAnalyser
- StandardAnalyzer
The latter is built upon Snowball, powerful for human languages but overkill for simple tokenization.
A simple tokenizer is proposed here. The need for this arose as a workaround of CASSANDRA-11182, and to avoid the disk usage explosion when having to resort to CONTAINS. See https://github.com/openzipkin/zipkin/issues/1861
Example use of this would be:
CREATE CUSTOM INDEX span_annotation_query_idx ON zipkin2.span (annotation_query) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.DelimiterAnalyzer', 'delimiter': '░', 'case_sensitive': 'true', 'mode': 'prefix', 'analyzed': 'true'};
Original credit for this work goes to https://github.com/zuochangan