[CASSANDRA-14247] SASI tokenizer for simple delimiter based entries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 3.11.3, 4.0-alpha1, 4.0
Component/s: Feature/SASI
Labels:
- sasi

Description

Currently SASI offers only two tokenizer options:

NonTokenizerAnalyser
StandardAnalyzer

The latter is built upon Snowball, powerful for human languages but overkill for simple tokenization.

A simple tokenizer is proposed here. The need for this arose as a workaround of CASSANDRA-11182, and to avoid the disk usage explosion when having to resort to CONTAINS. See https://github.com/openzipkin/zipkin/issues/1861

Example use of this would be:

CREATE CUSTOM INDEX span_annotation_query_idx 
    ON zipkin2.span (annotation_query) USING 'org.apache.cassandra.index.sasi.SASIIndex' 
    WITH OPTIONS = {
        'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.DelimiterAnalyzer', 
        'delimiter': '░',
        'case_sensitive': 'true', 
        'mode': 'prefix', 
        'analyzed': 'true'};

Original credit for this work goes to https://github.com/zuochangan

Attachments

Activity

People

Assignee:: Michael Semb Wever

Reporter:: Michael Semb Wever

Authors:: Michael Semb Wever

Reviewers:: Michael Kjellman

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 21/Feb/18 10:13

Updated:: 15/May/20 07:59

Resolved:: 15/Mar/18 09:46