Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-14247

SASI tokenizer for simple delimiter based entries

    XMLWordPrintableJSON

    Details

      Description

      Currently SASI offers only two tokenizer options:

      • NonTokenizerAnalyser
      • StandardAnalyzer

      The latter is built upon Snowball, powerful for human languages but overkill for simple tokenization.

      A simple tokenizer is proposed here. The need for this arose as a workaround of CASSANDRA-11182, and to avoid the disk usage explosion when having to resort to CONTAINS. See https://github.com/openzipkin/zipkin/issues/1861

      Example use of this would be:

      CREATE CUSTOM INDEX span_annotation_query_idx 
          ON zipkin2.span (annotation_query) USING 'org.apache.cassandra.index.sasi.SASIIndex' 
          WITH OPTIONS = {
              'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.DelimiterAnalyzer', 
              'delimiter': '░',
              'case_sensitive': 'true', 
              'mode': 'prefix', 
              'analyzed': 'true'};
      

      Original credit for this work goes to https://github.com/zuochangan

        Attachments

          Activity

            People

            • Assignee:
              mck Michael Semb Wever
              Reporter:
              mck Michael Semb Wever
              Authors:
              Michael Semb Wever
              Reviewers:
              Michael Kjellman
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: