Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-14247

SASI tokenizer for simple delimiter based entries

    XMLWordPrintableJSON

Details

    Description

      Currently SASI offers only two tokenizer options:

      • NonTokenizerAnalyser
      • StandardAnalyzer

      The latter is built upon Snowball, powerful for human languages but overkill for simple tokenization.

      A simple tokenizer is proposed here. The need for this arose as a workaround of CASSANDRA-11182, and to avoid the disk usage explosion when having to resort to CONTAINS. See https://github.com/openzipkin/zipkin/issues/1861

      Example use of this would be:

      CREATE CUSTOM INDEX span_annotation_query_idx 
          ON zipkin2.span (annotation_query) USING 'org.apache.cassandra.index.sasi.SASIIndex' 
          WITH OPTIONS = {
              'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.DelimiterAnalyzer', 
              'delimiter': '░',
              'case_sensitive': 'true', 
              'mode': 'prefix', 
              'analyzed': 'true'};
      

      Original credit for this work goes to https://github.com/zuochangan

      Attachments

        Activity

          People

            mck Michael Semb Wever
            mck Michael Semb Wever
            Michael Semb Wever
            Michael Kjellman
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: