Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-11122

SASI does not find term when indexing non-ascii character

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 3.4
    • Legacy/CQL
    • None
    • Cassandra 3.4 SNAPSHOT

    • Normal

    Description

      I built the snapshot version taken from here: https://github.com/xedin/cassandra/tree/CASSANDRA-11067

      I create a tiny musical dataset with non-ascii characters (cyrillic actually) and create a SASI index on the artist name.

      SASI can find rows for the cyrillic name but strangely fails to index normal ascii name ('Object').

      CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;
      
      CREATE TABLE music.albums (
          title text PRIMARY KEY,
          artist text
      );
      
      INSERT INTO music.albums(artist,title) VALUES('Object','The Reflecting Skin');
      INSERT INTO music.albums(artist,title) VALUES('Hayden','Mild and Hazy');
      INSERT INTO music.albums(artist,title) VALUES('Самое Большое Простое Число','СБПЧ Оркестр');
      
      CREATE custom INDEX on music.albums(artist) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
      
      SELECT * FROM music.albums;
      
      
      title               | artist
      ---------------------+-----------------------------
       The Reflecting Skin |                      Object
             Mild and Hazy |                      Hayden
              СБПЧ Оркестр | Самое Большое Простое Число
      
      (3 rows)
      
      SELECT * FROM music.albums WHERE artist='Самое Большое Простое Число';
      
      title               | artist
      ---------------------+-----------------------------
              СБПЧ Оркестр | Самое Большое Простое Число
      
      (1 rows)
      
      SELECT * FROM music.albums WHERE artist='Hayden';
      
      title               | artist
      ---------------------+-----------------------------
             Mild and Hazy |                      Hayden
      
      
      (1 rows)
      
      SELECT * FROM music.albums WHERE artist='Object';
      
      title               | artist
      ---------------------+-----------------------------
      
      (0 rows)
      
      SELECT * FROM music.albums WHERE artist like 'Ob%';
      
      title               | artist
      ---------------------+-----------------------------
      
      (0 rows)
      

      Strangely enough, after cleaning all the data and re-inserting without the russian artist with cyrillic name, SASI does find 'Object' ...

      DROP INDEX albums_artist_idx;
      TRUNCATE TABLE albums;
      
      INSERT INTO albums(artist,title) VALUES('Object','The Reflecting Skin');
      INSERT INTO albums(artist,title) VALUES('Hayden','Mild and Hazy');
      
      
      CREATE custom INDEX on music.albums(artist) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
      
      SELECT * FROM music.albums;
      
      
      title               | artist
      ---------------------+-----------------------------
       The Reflecting Skin |                      Object
             Mild and Hazy |                      Hayden
      
      (2 rows)
      
      SELECT * FROM music.albums WHERE artist='Object';
      
      title               | artist
      ---------------------+-----------------------------
       The Reflecting Skin |                      Object
      
      (1 rows)
      
      SELECT * FROM music.albums WHERE artist LIKE 'Ob%';
      
      title               | artist
      ---------------------+-----------------------------
       The Reflecting Skin |                      Object
      
      (1 rows)
      
      

      The behaviour is quite inconsistent. I can understand that SASI refuses to index cyrillic character or issue exception when encountering non-ascii characters (because we did not specify the locale) but it's very surprising that the indexing fails for normal ascii characters like Object

      Could it be that SASI start indexing the artist name by following table albums token range order (hash of title) and it stops indexing after encountering the cyrillic name ?

      Attachments

        1. 11122-range-term-tree-interval-tree.patch
          6 kB
          Sam Tunnicliffe
        2. CASSANDRA-11122.patch
          3 kB
          Pavel Yaskevich

        Activity

          People

            Unassigned Unassigned
            doanduyhai DuyHai Doan
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: