Details

      Description

      I think everyone agrees that a LIKE operator would be ideal, but that's probably not in scope for an initial 3.4 release.

      Still, I'm uncomfortable with the initial approach of overloading = to mean "satisfies index expression." The problem is that it will be very difficult to back out of this behavior once people are using it.

      I propose adding a new operator in the interim instead. Call it MATCHES, maybe. With the exact same behavior that SASI currently exposes, just with a separate operator rather than being rolled into =.

        Activity

        Hide
        jbellis Jonathan Ellis added a comment -
        Show
        jbellis Jonathan Ellis added a comment - /cc Pavel Yaskevich Jordan West
        Hide
        jkrupan Jack Krupansky added a comment -

        How about a severely restricted LIKE that only permits patterns ending with a "%" for prefix query (LIKE 'J%') or with a "%" at either end for contains (LIKE '%abc%') or ending with a % for suffix query (LIKE '%smith')? Then it would be fully compatible with SQL.

        In any case, "=" would then attempt an exact match using the SASI index? That would allow both exact and inexact matching for each column using a single index.

        If we can't have this restricted LIKE, descriptive keyword operators like SUFFIX and PREFIX would seem desirable. Could the existing CONTAINS operator also be used? They would also handle the case where the prefix/suffix/contains string is a parameter - otherwise the user has to do a messy concat.

        Show
        jkrupan Jack Krupansky added a comment - How about a severely restricted LIKE that only permits patterns ending with a "%" for prefix query (LIKE 'J%') or with a "%" at either end for contains (LIKE '%abc%') or ending with a % for suffix query (LIKE '%smith')? Then it would be fully compatible with SQL. In any case, "=" would then attempt an exact match using the SASI index? That would allow both exact and inexact matching for each column using a single index. If we can't have this restricted LIKE, descriptive keyword operators like SUFFIX and PREFIX would seem desirable. Could the existing CONTAINS operator also be used? They would also handle the case where the prefix/suffix/contains string is a parameter - otherwise the user has to do a messy concat.
        Hide
        jbellis Jonathan Ellis added a comment -

        If we can get a LIKE approach in for 3.4, so much the better. I'm saying that I don't think overloading = is acceptable to ship, and proposing the simplest possible alternative.

        Show
        jbellis Jonathan Ellis added a comment - If we can get a LIKE approach in for 3.4, so much the better. I'm saying that I don't think overloading = is acceptable to ship, and proposing the simplest possible alternative.
        Hide
        xedin Pavel Yaskevich added a comment -

        Sam Tunnicliffe Jonathan Ellis I'm going to take on this since I was about it do it anyway, LIKE itself is not a big change for SASI so I would rather go with that than MATCHES, should be able to make it work pretty fast.

        Show
        xedin Pavel Yaskevich added a comment - Sam Tunnicliffe Jonathan Ellis I'm going to take on this since I was about it do it anyway, LIKE itself is not a big change for SASI so I would rather go with that than MATCHES, should be able to make it work pretty fast.
        Hide
        jbellis Jonathan Ellis added a comment -

        Sounds good. Thanks!

        Show
        jbellis Jonathan Ellis added a comment - Sounds good. Thanks!
        Hide
        xedin Pavel Yaskevich added a comment -

        I've pushed all of the required changes to CASSANDRA-11067, currently supported:

        LIKE '%<term>' -> suffix
        LIKE '%<term>%' -> contains
        LIKE '<term>%' -> prefix
        LIKE '<term>' -> equals
        "=" is now exact equals as expected.

        Show
        xedin Pavel Yaskevich added a comment - I've pushed all of the required changes to CASSANDRA-11067 , currently supported: LIKE '%<term>' -> suffix LIKE '%<term>%' -> contains LIKE '<term>%' -> prefix LIKE '<term>' -> equals "=" is now exact equals as expected.
        Hide
        xedin Pavel Yaskevich added a comment -
        branch testall dtest
        CASSANDRA-11067 testall dtest
        Show
        xedin Pavel Yaskevich added a comment - branch testall dtest CASSANDRA-11067 testall dtest
        Hide
        jkrupan Jack Krupansky added a comment -

        Awesome! Watch out, SQL!

        One more nit...

        The fact that a SASI index needs to be "CUSTOM" and an explicit class name is needed feels a little hokey to me. Is there a longer-term plan to fully integrate SASI so it is a first-class feature rather than simply an add-on? In fact, is there any reason not to make it the default secondary indexing (other than the fact that is new and experimental and unproven in the real world yet)? Having the mode be a keyword rather than all this extra lexical distraction would feel better to me.

        But if this is billed as experimental in 3.4, maybe there is no real harm in deferring first-class status until a future feature release.

        Still, it would be nice to be able to say CREATE PREFIX INDEX or CREATE SUFFIX INDEX or CREATE SPARSE INDEX.

        Show
        jkrupan Jack Krupansky added a comment - Awesome! Watch out, SQL! One more nit... The fact that a SASI index needs to be "CUSTOM" and an explicit class name is needed feels a little hokey to me. Is there a longer-term plan to fully integrate SASI so it is a first-class feature rather than simply an add-on? In fact, is there any reason not to make it the default secondary indexing (other than the fact that is new and experimental and unproven in the real world yet)? Having the mode be a keyword rather than all this extra lexical distraction would feel better to me. But if this is billed as experimental in 3.4, maybe there is no real harm in deferring first-class status until a future feature release. Still, it would be nice to be able to say CREATE PREFIX INDEX or CREATE SUFFIX INDEX or CREATE SPARSE INDEX.
        Hide
        jbellis Jonathan Ellis added a comment -

        Agreed Jack, but this is something that can be added later with no compatibility issues.

        Show
        jbellis Jonathan Ellis added a comment - Agreed Jack, but this is something that can be added later with no compatibility issues.
        Hide
        jrwest Jordan West added a comment -

        other than the fact that is new and experimental and unproven in the real world yet?

        I think SASI is as, or more, proven than any change in the 3.x releases. Its been in production for longer than any 3.x feature and most of the changes for 3.x were surface-level integration changes as Pavel Yaskevich mentioned.

        The fact that a SASI index needs to be "CUSTOM" and an explicit class name is needed feels a little hokey to me.

        Agreed but we decided not to change this to ease the merge and because the sstable's format is not extendable easily, currently. Also, this the case for any non-default index class. I think it would be great to SASI become the default implementation or to have an easier way to specify which implementation to use.

        Show
        jrwest Jordan West added a comment - other than the fact that is new and experimental and unproven in the real world yet? I think SASI is as, or more, proven than any change in the 3.x releases. Its been in production for longer than any 3.x feature and most of the changes for 3.x were surface-level integration changes as Pavel Yaskevich mentioned. The fact that a SASI index needs to be "CUSTOM" and an explicit class name is needed feels a little hokey to me. Agreed but we decided not to change this to ease the merge and because the sstable's format is not extendable easily, currently. Also, this the case for any non-default index class. I think it would be great to SASI become the default implementation or to have an easier way to specify which implementation to use.
        Hide
        beobal Sam Tunnicliffe added a comment -

        I think it would be great to SASI become the default implementation or to have an easier way to specify which implementation to use

        +1 I don't see a reason why the SASI integration wouldn't ultimately remove the "CUSTOM" demarcation. That will require some minor-ish syntax changes and tweaks to validation of index metadata etc but nothing seems overly problematic there.

        Show
        beobal Sam Tunnicliffe added a comment - I think it would be great to SASI become the default implementation or to have an easier way to specify which implementation to use +1 I don't see a reason why the SASI integration wouldn't ultimately remove the "CUSTOM" demarcation. That will require some minor-ish syntax changes and tweaks to validation of index metadata etc but nothing seems overly problematic there.
        Hide
        slebresne Sylvain Lebresne added a comment -

        Agreed about consider SASI as the default (and short of that to make it more friendly syntax-wise) but that's really a separate issue that needs its own considering so created CASSANDRA-11075. Let's please keep the discussion focused on what the ticket is about.

        Show
        slebresne Sylvain Lebresne added a comment - Agreed about consider SASI as the default (and short of that to make it more friendly syntax-wise) but that's really a separate issue that needs its own considering so created CASSANDRA-11075 . Let's please keep the discussion focused on what the ticket is about.
        Hide
        jkrupan Jack Krupansky added a comment -

        Thanks, Sylvain Lebresne, for opening that separate issue. My apologies for taking advantage of the vague and general wording of the title/summary of this particular Jira. I had considered making my suggestions on the original ticket, but didn't when I saw that it was already "closed" and this one is suggestively labeled "Improve SASI syntax" (rather than "Restore = semantics for SASI".) Again, sorry for the distraction from getting SASI done for 3.4 ASAP.

        Show
        jkrupan Jack Krupansky added a comment - Thanks, Sylvain Lebresne , for opening that separate issue. My apologies for taking advantage of the vague and general wording of the title/summary of this particular Jira. I had considered making my suggestions on the original ticket, but didn't when I saw that it was already "closed" and this one is suggestively labeled "Improve SASI syntax" (rather than "Restore = semantics for SASI".) Again, sorry for the distraction from getting SASI done for 3.4 ASAP.
        Hide
        xedin Pavel Yaskevich added a comment - - edited

        Sam Tunnicliffe I've updated doc/SASI.md with LIKE support and fixed a typo in Cql.g, also I've noticed that loading of the SASI indexes on startup is missing so I've pushed that too into CASSANDRA-11067 branch (and fixed use of IntervalTree on the way) and rebased everything with the latest trunk.

        Edit: I've also (just) added ByteBufferUtil.{startsWith, endsWith} to complement ByteBufferUtil.contains for LIKE_{PREFIX, SUFFIX} accordingly.

        Show
        xedin Pavel Yaskevich added a comment - - edited Sam Tunnicliffe I've updated doc/SASI.md with LIKE support and fixed a typo in Cql.g, also I've noticed that loading of the SASI indexes on startup is missing so I've pushed that too into CASSANDRA-11067 branch (and fixed use of IntervalTree on the way) and rebased everything with the latest trunk. Edit: I've also (just) added ByteBufferUtil.{startsWith, endsWith} to complement ByteBufferUtil.contains for LIKE_{PREFIX, SUFFIX} accordingly.
        Hide
        jbellis Jonathan Ellis added a comment - - edited

        The example of stemming still uses the equality operator (bio='distributing'). Just an oversight in the doc?

        Show
        jbellis Jonathan Ellis added a comment - - edited The example of stemming still uses the equality operator ( bio='distributing' ). Just an oversight in the doc?
        Hide
        xedin Pavel Yaskevich added a comment -

        Jonathan Ellis It was actually intentional since stemming is a splitting analyzer, it's going to split a sentence into individual words and steam them, so "=" is a valid use case there.

        Show
        xedin Pavel Yaskevich added a comment - Jonathan Ellis It was actually intentional since stemming is a splitting analyzer, it's going to split a sentence into individual words and steam them, so "=" is a valid use case there.
        Hide
        jbellis Jonathan Ellis added a comment -

        I still think LIKE is a better fit here, because we're asking for a tokenized match, not full equality. (I'm fine with using LIKE without wildcards, though.)

        Show
        jbellis Jonathan Ellis added a comment - I still think LIKE is a better fit here, because we're asking for a tokenized match, not full equality. (I'm fine with using LIKE without wildcards, though.)
        Hide
        xedin Pavel Yaskevich added a comment -

        Jonathan Ellis Ok sure, I changed bio to LIKE without wildcards (which is effectively the same thing as equals) everywhere and rebased with the latest trunk.

        Show
        xedin Pavel Yaskevich added a comment - Jonathan Ellis Ok sure, I changed bio to LIKE without wildcards (which is effectively the same thing as equals) everywhere and rebased with the latest trunk.
        Hide
        beobal Sam Tunnicliffe added a comment -

        Mostly lgtm, I think there's just a couple of things:

        We need some validation of the index options with respect to the type of the target column. For instance:

        CREATE TABLE ks.t1(k int, v int, PRIMARY KEY(k));
        CREATE CUSTOM INDEX v_contains_index on ks.t1(v) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode' : 'CONTAINS'};
        

        This happily creates an index which can be queried using EQ, but not with LT/LTE/GT/GTE (in contrast to the one created if OPTIONS is ommitted). Of course, LIKE can't work for this index either. I don't think this is a showstopper, but it is somewhat counterintuitive for users. I'd be ok with leaving this for now if you'd rather open a ticket with broader scope to cover any/all validation we need to add, otherwise let's just address this particular instance now.

        This is very minor, but the error message is a little awkward when you try to execute a prefix/suffix match on an index which doesn't support that:

        cqlsh> select * from ks.t1 where name LIKE 'Pav%';
        InvalidRequest: code=2200 [Invalid query] message="name LIKE '<term>%' 'Pav' restriction is only supported on properly indexed columns"
        

        I don't think having the actual term value there is particularly useful (and interpolating it into the Operator string is fiddly) , so perhaps a specialisation in SingleColumnRelation like the one for isIN would work:

        if (isLIKE())
            return String.format("%s %s", entityAsString, relationType);
        

        Suffix matching doesn't seem to have the same level of coverage in SASIIndexTest as prefix and contains. Do you think it's worth adding anything, just for clarity if nothing else?

        Finally, in SASI.md, there's a couple of minor nits:

        The semantics of CQL's = were modified instead of making further modifications of the grammar with the introduction of a LIKE operator. Ideally, CQL would be modified to include such an operator, supporting both prefix and suffix searches.

        can be removed

        References to Indexer in "The SASIIndex Class" section should be to Searcher, right?

        Limitations and Caveats refers to the "lack of LIKE"

        Show
        beobal Sam Tunnicliffe added a comment - Mostly lgtm, I think there's just a couple of things: We need some validation of the index options with respect to the type of the target column. For instance: CREATE TABLE ks.t1(k int , v int , PRIMARY KEY(k)); CREATE CUSTOM INDEX v_contains_index on ks.t1(v) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode' : 'CONTAINS'}; This happily creates an index which can be queried using EQ, but not with LT/LTE/GT/GTE (in contrast to the one created if OPTIONS is ommitted). Of course, LIKE can't work for this index either. I don't think this is a showstopper, but it is somewhat counterintuitive for users. I'd be ok with leaving this for now if you'd rather open a ticket with broader scope to cover any/all validation we need to add, otherwise let's just address this particular instance now. This is very minor, but the error message is a little awkward when you try to execute a prefix/suffix match on an index which doesn't support that: cqlsh> select * from ks.t1 where name LIKE 'Pav%'; InvalidRequest: code=2200 [Invalid query] message= "name LIKE '<term>%' 'Pav' restriction is only supported on properly indexed columns" I don't think having the actual term value there is particularly useful (and interpolating it into the Operator string is fiddly) , so perhaps a specialisation in SingleColumnRelation like the one for isIN would work: if (isLIKE()) return String .format( "%s %s" , entityAsString, relationType); Suffix matching doesn't seem to have the same level of coverage in SASIIndexTest as prefix and contains. Do you think it's worth adding anything, just for clarity if nothing else? Finally, in SASI.md, there's a couple of minor nits: The semantics of CQL's = were modified instead of making further modifications of the grammar with the introduction of a LIKE operator. Ideally, CQL would be modified to include such an operator, supporting both prefix and suffix searches. can be removed References to Indexer in "The SASIIndex Class" section should be to Searcher , right? Limitations and Caveats refers to the "lack of LIKE"
        Hide
        beobal Sam Tunnicliffe added a comment -

        One more thing: the grammar only allows LIKE (upper case), can you make it case insensitive please?

        Show
        beobal Sam Tunnicliffe added a comment - One more thing: the grammar only allows LIKE (upper case), can you make it case insensitive please?
        Hide
        xedin Pavel Yaskevich added a comment -

        I've pushed to CASSANDRA-11067 squashed/rebased/updated version with all of the changes from the review - doc/error message are corrected and I've added a lot more LIKE_SUFFIX tests to SASIIndexTest. Going to try to make down-case LIKE work and let you know, the problem is it classes with rules from create statement.

        Show
        xedin Pavel Yaskevich added a comment - I've pushed to CASSANDRA-11067 squashed/rebased/updated version with all of the changes from the review - doc/error message are corrected and I've added a lot more LIKE_SUFFIX tests to SASIIndexTest. Going to try to make down-case LIKE work and let you know, the problem is it classes with rules from create statement.
        Hide
        jbellis Jonathan Ellis added a comment -

        I'm still uneasy about making = a synonym for LIKE. Every database I can think of treats = as full equality. I'm pretty sure that will surprise people.

        I think we should reject = queries against indexes that aren't going to actually give equality semantics, and require using LIKE instead.

        (I'm slightly uneasy about overloading LIKE as well, which is why originally I proposed something new like MATCHES. But LIKE already tends to vary from product to product so on balance I'm okay with abusing it a little more.)

        Show
        jbellis Jonathan Ellis added a comment - I'm still uneasy about making = a synonym for LIKE. Every database I can think of treats = as full equality. I'm pretty sure that will surprise people. I think we should reject = queries against indexes that aren't going to actually give equality semantics, and require using LIKE instead. (I'm slightly uneasy about overloading LIKE as well, which is why originally I proposed something new like MATCHES. But LIKE already tends to vary from product to product so on balance I'm okay with abusing it a little more.)
        Hide
        xedin Pavel Yaskevich added a comment -

        Regarding validation for index options you've mentioned I'm not really sure by what means we are going to accomplish that, maybe by requiring users to specify what queries do they want to run, but that would require wider audience/discussion so let's maybe leave it for a separate ticket...

        Show
        xedin Pavel Yaskevich added a comment - Regarding validation for index options you've mentioned I'm not really sure by what means we are going to accomplish that, maybe by requiring users to specify what queries do they want to run, but that would require wider audience/discussion so let's maybe leave it for a separate ticket...
        Hide
        rustyrazorblade Jon Haddad added a comment -

        Just for reference:

        MySQL: match(field) against ('search')
        postgres: something about @@ tsvector, I can't even tell it's so absurd
        oracle: CONTAINS(field, 'search', 1)
        sql server: CONTAINS(candidate_resume,”SQL Server”)

        My 2 cents: I'm partial to CONTAINS or MATCH rather than LIKE, since LIKE seems to usually be exact string search with a full table scan.

        Show
        rustyrazorblade Jon Haddad added a comment - Just for reference: MySQL: match(field) against ('search') postgres: something about @@ tsvector, I can't even tell it's so absurd oracle: CONTAINS(field, 'search', 1) sql server: CONTAINS(candidate_resume,”SQL Server”) My 2 cents: I'm partial to CONTAINS or MATCH rather than LIKE, since LIKE seems to usually be exact string search with a full table scan.
        Hide
        xedin Pavel Yaskevich added a comment -

        I'm still uneasy about making = a synonym for LIKE. Every database I can think of treats = as full equality. I'm pretty sure that will surprise people.

        But the LIKE is not a synonym for "=", only if LIKE is used without '%' it means essentially an equals, and "=" is always an exact match right now. Also I would like to point out that all of the index types in SASI support strict equality (current "=") because even CONTAINS mode does preserve the original word form, so I'm not even sure what is the problem because each index has supportedOperations check and if one of the implementations doesn't support strict equality it's going to be rejected because even LIKE '<term>' is transformed to EQ, which forces users to do LIKE with '%' which can only be PREFIX, CONTAINS or SUFFIX.

        Show
        xedin Pavel Yaskevich added a comment - I'm still uneasy about making = a synonym for LIKE. Every database I can think of treats = as full equality. I'm pretty sure that will surprise people. But the LIKE is not a synonym for "=", only if LIKE is used without '%' it means essentially an equals, and "=" is always an exact match right now. Also I would like to point out that all of the index types in SASI support strict equality (current "=") because even CONTAINS mode does preserve the original word form, so I'm not even sure what is the problem because each index has supportedOperations check and if one of the implementations doesn't support strict equality it's going to be rejected because even LIKE '<term>' is transformed to EQ, which forces users to do LIKE with '%' which can only be PREFIX, CONTAINS or SUFFIX.
        Hide
        xedin Pavel Yaskevich added a comment -

        Sam Tunnicliffe I've fixed LIKE to be case-insensitive and pushed amended commit to the issue branch, going to trigger CI build.

        Show
        xedin Pavel Yaskevich added a comment - Sam Tunnicliffe I've fixed LIKE to be case-insensitive and pushed amended commit to the issue branch, going to trigger CI build.
        Hide
        jbellis Jonathan Ellis added a comment -

        only if LIKE is used without '%' it means essentially an equals, and "=" is always an exact match right now.

        But you're talking about exact match for search terms, right? I'm talking about exact match for the entire column.

        So,

        SELECT * FROM sasi WHERE bio = 'distributing'
        

        should return zero rows, but

        SELECT * FROM sasi WHERE bio = 'Software Engineer, works on the freight distribution at nights and likes arguing'
        

        should return one row.

        If that's how it already works, I apologize for the noise!

        Show
        jbellis Jonathan Ellis added a comment - only if LIKE is used without '%' it means essentially an equals, and "=" is always an exact match right now. But you're talking about exact match for search terms, right? I'm talking about exact match for the entire column. So, SELECT * FROM sasi WHERE bio = 'distributing' should return zero rows, but SELECT * FROM sasi WHERE bio = 'Software Engineer, works on the freight distribution at nights and likes arguing' should return one row. If that's how it already works, I apologize for the noise!
        Hide
        xedin Pavel Yaskevich added a comment - - edited

        If you don't add an analyzer to the column which does stemming and tokenization it would work exactly how you describe - "distributing" would return 0 results and whole string would be 1, it's tokenization feature which makes it work the way it does in the example because after tokenization every term in of that string is a separate entity, and even more in case of "distributing" - only it's stem is going to be saved which is "distribut" that's why matching "distributing" vs. "distribution" which is an original value is going to produce results, but to make it work multiple additional SASI options are needed, by default it's not going to do any of that and going to behave like you describe.

        Show
        xedin Pavel Yaskevich added a comment - - edited If you don't add an analyzer to the column which does stemming and tokenization it would work exactly how you describe - "distributing" would return 0 results and whole string would be 1, it's tokenization feature which makes it work the way it does in the example because after tokenization every term in of that string is a separate entity, and even more in case of "distributing" - only it's stem is going to be saved which is "distribut" that's why matching "distributing" vs. "distribution" which is an original value is going to produce results, but to make it work multiple additional SASI options are needed, by default it's not going to do any of that and going to behave like you describe.
        Hide
        jbellis Jonathan Ellis added a comment -

        So I'm saying we need to keep = always pinned to untokenized semantics. My understanding is that we could either doing this by doing a tokenized index lookup, and then an extra filter stage to throw away non-exact-matches. But alternatively we could just disallow using = on a tokenized index query and require LIKE instead.

        Show
        jbellis Jonathan Ellis added a comment - So I'm saying we need to keep = always pinned to untokenized semantics. My understanding is that we could either doing this by doing a tokenized index lookup, and then an extra filter stage to throw away non-exact-matches. But alternatively we could just disallow using = on a tokenized index query and require LIKE instead.
        Hide
        xedin Pavel Yaskevich added a comment -

        The problem which that approach is some of the tokenization doesn't cause changes to internal structure of the text so requiring always to use "=" is not aways adequate. I would say we need to let indexing implementations decide what ops to support and how to support them, because "=" for tokenized/stemmed results is a norm.

        Show
        xedin Pavel Yaskevich added a comment - The problem which that approach is some of the tokenization doesn't cause changes to internal structure of the text so requiring always to use "=" is not aways adequate. I would say we need to let indexing implementations decide what ops to support and how to support them, because "=" for tokenized/stemmed results is a norm.
        Hide
        xedin Pavel Yaskevich added a comment -

        Sam Tunnicliffe FYI, all CI tests are complete, dtest has 0 errors and testall has unrelated failure which happens in the trunk as well.

        Show
        xedin Pavel Yaskevich added a comment - Sam Tunnicliffe FYI, all CI tests are complete, dtest has 0 errors and testall has unrelated failure which happens in the trunk as well.
        Hide
        jbellis Jonathan Ellis added a comment -

        The problem which that approach is some of the tokenization doesn't cause changes to internal structure of the text so requiring always to use "=" is not aways adequate.

        Not sure what you mean. You can always use LIKE 'term' where you could use = 'term', right? So just use LIKE and leave = for untokenized (and numeric) comparison.

        I would say we need to let indexing implementations decide what ops to support and how to support them, because "=" for tokenized/stemmed results is a norm.

        Where is = the norm? Jon gave a good summary above for LIKE behavior, I can add that none of those systems overload = for tokenization.

        Show
        jbellis Jonathan Ellis added a comment - The problem which that approach is some of the tokenization doesn't cause changes to internal structure of the text so requiring always to use "=" is not aways adequate. Not sure what you mean. You can always use LIKE 'term' where you could use = 'term' , right? So just use LIKE and leave = for untokenized (and numeric) comparison. I would say we need to let indexing implementations decide what ops to support and how to support them, because "=" for tokenized/stemmed results is a norm. Where is = the norm? Jon gave a good summary above for LIKE behavior, I can add that none of those systems overload = for tokenization.
        Hide
        xedin Pavel Yaskevich added a comment -

        Not sure what you mean. You can always use LIKE 'term' where you could use = 'term', right? So just use LIKE and leave = for untokenized (and numeric) comparison.

        The point I'm trying to make - there is no way to tell what analyzer would do with the data, which means that every time analyzer is set on the field I need to force all of the queries to use LIKE instead of "=" that doesn't really make a lot of sense if analyzer/tokenizer doesn't do anything special with the data e.g. it's simply down-cases it, I can make such change but the problem with it is - index parameterization are going to directly affect requests, because after adding analyzer, all of the previous "=" are just going to fail although they might be still totally valid ones. I would also like to emphasize the point that by just specifying 'mode' everything works in the "canonical" way - "=" is strict equality and LIKE is like but as soon as users are going to start playing with tokenization and other analyzer properties they would directly become aware what would LIKE and "=" mean in the new context.

        Where is = the norm? Jon gave a good summary above for LIKE behavior, I can add that none of those systems overload = for tokenization.

        I'm more talking about free text search (something like Lucene/Solr) because that's the only place where analyzers like stemming and tokenization are useful which don't have a lot specific syntactic constructs (e.g. ":") but use something which is effectively an "=" in most of the analyzer/tokenization involved situations.

        Show
        xedin Pavel Yaskevich added a comment - Not sure what you mean. You can always use LIKE 'term' where you could use = 'term', right? So just use LIKE and leave = for untokenized (and numeric) comparison. The point I'm trying to make - there is no way to tell what analyzer would do with the data, which means that every time analyzer is set on the field I need to force all of the queries to use LIKE instead of "=" that doesn't really make a lot of sense if analyzer/tokenizer doesn't do anything special with the data e.g. it's simply down-cases it, I can make such change but the problem with it is - index parameterization are going to directly affect requests, because after adding analyzer, all of the previous "=" are just going to fail although they might be still totally valid ones. I would also like to emphasize the point that by just specifying 'mode' everything works in the "canonical" way - "=" is strict equality and LIKE is like but as soon as users are going to start playing with tokenization and other analyzer properties they would directly become aware what would LIKE and "=" mean in the new context. Where is = the norm? Jon gave a good summary above for LIKE behavior, I can add that none of those systems overload = for tokenization. I'm more talking about free text search (something like Lucene/Solr) because that's the only place where analyzers like stemming and tokenization are useful which don't have a lot specific syntactic constructs (e.g. ":") but use something which is effectively an "=" in most of the analyzer/tokenization involved situations.
        Hide
        jbellis Jonathan Ellis added a comment -

        the problem with it is - index parameterization are going to directly affect requests, because after adding analyzer, all of the previous "=" are just going to fail although they might be still totally valid ones.

        I'm not sure I follow. Can you give an example?

        Show
        jbellis Jonathan Ellis added a comment - the problem with it is - index parameterization are going to directly affect requests, because after adding analyzer, all of the previous "=" are just going to fail although they might be still totally valid ones. I'm not sure I follow. Can you give an example?
        Hide
        xedin Pavel Yaskevich added a comment -

        Sure, here is one of the scenarios:

        • I create an index on the column - first_name and I just set up 'mode' = 'PREFIX'
        • all the queries I do need to be on the strict equality e.g. `SELECT * FROM users WHERE first_name = 'Pavel' LIMIT 10;`
        • But then I realize that what I really want is results to be case insensitive still do a strict equality on the word itself, so I add
          option to the index - set analyzer class to NonTokenizingAnalyzer and set "case_sensitive" to "false"
        • `SELECT * FROM users WHERE first_name = 'Pavel' LIMIT 10;` would still produce results (probably more since it can match PaVel, pavel etc.) because analyzer would process input text and down-case and match it internally with already down-cased index although the original data is not touched at all.

        But if we do query enforce LIKE it would mean that my `first_name =` is going to start failing with something like following "InvalidRequest("Use LIKE instead of EQ for column 'first_name'") instead, which doesn't really make much sense to me as a user because I still want to do my strict equality queries.

        Show
        xedin Pavel Yaskevich added a comment - Sure, here is one of the scenarios: I create an index on the column - first_name and I just set up 'mode' = 'PREFIX' all the queries I do need to be on the strict equality e.g. `SELECT * FROM users WHERE first_name = 'Pavel' LIMIT 10;` But then I realize that what I really want is results to be case insensitive still do a strict equality on the word itself, so I add option to the index - set analyzer class to NonTokenizingAnalyzer and set "case_sensitive" to "false" `SELECT * FROM users WHERE first_name = 'Pavel' LIMIT 10;` would still produce results (probably more since it can match PaVel, pavel etc.) because analyzer would process input text and down-case and match it internally with already down-cased index although the original data is not touched at all. But if we do query enforce LIKE it would mean that my `first_name =` is going to start failing with something like following "InvalidRequest("Use LIKE instead of EQ for column 'first_name'") instead, which doesn't really make much sense to me as a user because I still want to do my strict equality queries.
        Hide
        jbellis Jonathan Ellis added a comment -

        I see. I think I'm okay with that. Again, if you want to avoid that, you can use LIKE instead, which will change behavior to match the index by design. Whereas equals does not, also by design.

        Re the prior art, I think that where the behavior of databases and search systems conflict, we should prefer to act like a database. I also note that Solr uses "q=term" syntax but this is saying the query is this term, not the document which is what we're talking about here. And ES uses "match" in all the examples I could find.

        Show
        jbellis Jonathan Ellis added a comment - I see. I think I'm okay with that. Again, if you want to avoid that, you can use LIKE instead, which will change behavior to match the index by design. Whereas equals does not, also by design. Re the prior art, I think that where the behavior of databases and search systems conflict, we should prefer to act like a database. I also note that Solr uses "q=term" syntax but this is saying the query is this term, not the document which is what we're talking about here. And ES uses "match" in all the examples I could find.
        Hide
        xedin Pavel Yaskevich added a comment -

        I disagree on both points, so let me clarify:

        Again, if you want to avoid that, you can use LIKE instead, which will change behavior to match the index by design. Whereas equals does not, also by design.

        This is not true - since PREFIX index can and does match both - EQ and PREFIX queries so '=' query makes sense to use there since it's - give me only results completely matching string N without accounting for case sensitivity, where LIKE query makes no sense in that context because LIKE effectively means "matching" and not "exact".

        I think that where the behavior of databases and search systems conflict we should prefer to act like a database.

        I disagree, because basically what you are saying here - we don't want analyzers, but they are essential for operation of indexes like SASI which can do both - indexing and search and each of the columns is considered a collection of indexable terms that can be intersected together with other columns via the set of unifying row ids.

        Which brings us to the bigger point - if we want to control how/what indexes do we should not provide a pluggable API so people can build their on but if we do we should respect decisions that authors took since that is an opt-in feature.

        Show
        xedin Pavel Yaskevich added a comment - I disagree on both points, so let me clarify: Again, if you want to avoid that, you can use LIKE instead, which will change behavior to match the index by design. Whereas equals does not, also by design. This is not true - since PREFIX index can and does match both - EQ and PREFIX queries so '=' query makes sense to use there since it's - give me only results completely matching string N without accounting for case sensitivity, where LIKE query makes no sense in that context because LIKE effectively means "matching" and not "exact". I think that where the behavior of databases and search systems conflict we should prefer to act like a database. I disagree, because basically what you are saying here - we don't want analyzers, but they are essential for operation of indexes like SASI which can do both - indexing and search and each of the columns is considered a collection of indexable terms that can be intersected together with other columns via the set of unifying row ids. Which brings us to the bigger point - if we want to control how/what indexes do we should not provide a pluggable API so people can build their on but if we do we should respect decisions that authors took since that is an opt-in feature.
        Hide
        jbellis Jonathan Ellis added a comment -

        PREFIX index can and does match both - EQ and PREFIX queries

        Isn't that the difference between LIKE 'term' and LIKE 'term%' ?

        Show
        jbellis Jonathan Ellis added a comment - PREFIX index can and does match both - EQ and PREFIX queries Isn't that the difference between LIKE 'term' and LIKE 'term%' ?
        Hide
        xedin Pavel Yaskevich added a comment -

        Isn't that the difference between LIKE 'term' and LIKE 'term%' ?

        It is a currently a difference, but it's not exactly the point but rather that `LIKE 'term'` is essentially but not necessarily `= 'term'` and definitely not vice versa, for example LIKE '<term>' and = '<term>' would be completely different meaning for indexes which support fuzzy matching, hence I'm against the idea of enforcing LIKE 'term' vs. = 'term' which locks-in index implementers and backend into the certain not flexible framework, the same way as current assumption that everything is separated by AND does.

        I can add additional MATCHES type to Operator class instead of converting LIKE '<term>' to Operator.EQ but SASI indexes would always implement both EQ and MATCH as well as PREFIX, CONTAINS, SUFFIX depending on the index type, so I'm not sure how useful it's going to be in the current situation.

        Show
        xedin Pavel Yaskevich added a comment - Isn't that the difference between LIKE 'term' and LIKE 'term%' ? It is a currently a difference, but it's not exactly the point but rather that `LIKE 'term'` is essentially but not necessarily `= 'term'` and definitely not vice versa, for example LIKE '<term>' and = '<term>' would be completely different meaning for indexes which support fuzzy matching, hence I'm against the idea of enforcing LIKE 'term' vs. = 'term' which locks-in index implementers and backend into the certain not flexible framework, the same way as current assumption that everything is separated by AND does. I can add additional MATCHES type to Operator class instead of converting LIKE '<term>' to Operator.EQ but SASI indexes would always implement both EQ and MATCH as well as PREFIX, CONTAINS, SUFFIX depending on the index type, so I'm not sure how useful it's going to be in the current situation.
        Hide
        jbellis Jonathan Ellis added a comment -

        I feel like we could come to an agreement in person relatively quickly but I am traveling the rest of this week. Let's proceed with this ticket as implemented (it is definitely a big improvement!) and see if we can find time to talk next week.

        Show
        jbellis Jonathan Ellis added a comment - I feel like we could come to an agreement in person relatively quickly but I am traveling the rest of this week. Let's proceed with this ticket as implemented (it is definitely a big improvement!) and see if we can find time to talk next week.
        Hide
        xedin Pavel Yaskevich added a comment -

        Sounds good! Let me know when you'd have time to chat about this, we should definitely sort this out, since it feels like we are talking about two different but closely related things.

        Maybe what you mean is - only if index mode is CONTAINS you want to force people to always use LIKE in any situation? If so additional Operator.MATCHES is going to make it work.

        Show
        xedin Pavel Yaskevich added a comment - Sounds good! Let me know when you'd have time to chat about this, we should definitely sort this out, since it feels like we are talking about two different but closely related things. Maybe what you mean is - only if index mode is CONTAINS you want to force people to always use LIKE in any situation? If so additional Operator.MATCHES is going to make it work.
        Hide
        beobal Sam Tunnicliffe added a comment -

        Regarding validation for index options you've mentioned I'm not really sure by what means we are going to accomplish that

        Fair enough, comprehensive validation at index creation is difficult without knowing or limiting the set of expected queries. The pressing problem I see is that the choice of MemIndex implementation (goverened by whether the indexed terms are considered literals) constrains the supported operations. Because its underlying trie doesn't support range operations, TrieMemIndex rejects range queries by throwing an UnsupportedOperationException at execution time. Although this is slightly different to the problem described in CASSANDRA-11043, the outcome is the same:

        cqlsh> create table ks.t1(k int primary key, v int);
        cqlsh> create custom index prefix_idx on ks.t1(v) using 'org.apache.cassandra.index.sasi.SASIIndex' with options = { 'mode':'PREFIX', 'is_literal':'true'};
        cqlsh> select * from ks.t1 where v > 0;
        Traceback (most recent call last):
          File "bin/cqlsh.py", line 1250, in perform_simple_statement
            result = future.result()
          File "/home/sam/projects/cassandra/bin/../lib/cassandra-driver-internal-only-3.0.0-6af642d.zip/cassandra-driver-3.0.0-6af642d/cassandra/cluster.py", line 3122, in result
            raise self._final_exception
        ReadFailure: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 1 failures" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
        

        I played with simply extending ColumnIndex to return false if isLiteral() && op == RANGE which fixes this specific problem. My concern is that that may be others that I haven't considered, wdyt?

        A couple of other, minor things (sorry, I missed these earlier):

        Log messages generated during flush are generally written at DEBUG level now (and so don't appear in system.log with the default config). There are a few related messages from SASI still logged at INFO, can those be switched to DEBUG?

        One more nit in the docs: IndexMemtable section still refers to NORMAL and SUFFIX indexing modes.

        Show
        beobal Sam Tunnicliffe added a comment - Regarding validation for index options you've mentioned I'm not really sure by what means we are going to accomplish that Fair enough, comprehensive validation at index creation is difficult without knowing or limiting the set of expected queries. The pressing problem I see is that the choice of MemIndex implementation (goverened by whether the indexed terms are considered literals) constrains the supported operations. Because its underlying trie doesn't support range operations, TrieMemIndex rejects range queries by throwing an UnsupportedOperationException at execution time. Although this is slightly different to the problem described in CASSANDRA-11043 , the outcome is the same: cqlsh> create table ks.t1(k int primary key, v int ); cqlsh> create custom index prefix_idx on ks.t1(v) using 'org.apache.cassandra.index.sasi.SASIIndex' with options = { 'mode':'PREFIX', 'is_literal':' true '}; cqlsh> select * from ks.t1 where v > 0; Traceback (most recent call last): File "bin/cqlsh.py" , line 1250, in perform_simple_statement result = future .result() File "/home/sam/projects/cassandra/bin/../lib/cassandra-driver-internal-only-3.0.0-6af642d.zip/cassandra-driver-3.0.0-6af642d/cassandra/cluster.py" , line 3122, in result raise self._final_exception ReadFailure: code=1300 [Replica(s) failed to execute read] message= "Operation failed - received 0 responses and 1 failures" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} I played with simply extending ColumnIndex to return false if isLiteral() && op == RANGE which fixes this specific problem. My concern is that that may be others that I haven't considered, wdyt? A couple of other, minor things (sorry, I missed these earlier): Log messages generated during flush are generally written at DEBUG level now (and so don't appear in system.log with the default config). There are a few related messages from SASI still logged at INFO, can those be switched to DEBUG? One more nit in the docs: IndexMemtable section still refers to NORMAL and SUFFIX indexing modes.
        Hide
        doanduyhai DOAN DuyHai added a comment -

        I agree with Jonathan POV wrt the semantics of =.

        Let me rephrase it with some example:

        As an user, when issuing WHERE name='John' with SASI, I expect:

        • If the index on name is case-insensitive, SASI should return rows having 'John' or 'john' or 'joHn' etc...
        • Else SASI should return rows having exactly 'John'

        In both cases (case-sensitive or not), I don't expect SASI to return me 'Johnathan' (with extra characters at the end) or 'Leejohn' (extra characters at the beginning) etc ...

        Pavel Yaskevich, what combination of options configurations can give me the above behaviour ? SPARSE mode ? NORMAL (formerly called PREFIX) mode without any analyzer ?

        Show
        doanduyhai DOAN DuyHai added a comment - I agree with Jonathan POV wrt the semantics of = . Let me rephrase it with some example: As an user, when issuing WHERE name='John' with SASI, I expect: If the index on name is case-insensitive, SASI should return rows having 'John' or 'john' or 'joHn' etc... Else SASI should return rows having exactly 'John' In both cases (case-sensitive or not), I don't expect SASI to return me 'Johnathan' (with extra characters at the end) or 'Leejohn' (extra characters at the beginning) etc ... Pavel Yaskevich , what combination of options configurations can give me the above behaviour ? SPARSE mode ? NORMAL (formerly called PREFIX) mode without any analyzer ?
        Hide
        jkrupan Jack Krupansky added a comment -

        For reference, over in Solr land users constantly struggle with how to combine exact and partial matching - sometimes they want an absolute literal match for the full field/column, sometimes a wildcard on that full field, sometimes keyword tokenization, sometimes wildcard on tokenized terms, sometimes phrases of tokenized terms, and sometimes phrases from the full literal string. Unfortunately, Solr doesn't have a direct answer for that, so people are forced to copy the field (typically a <copyyField>) directive and then one field is the literal string and the other is the tokenized field. That gives them complete control at query time, so q=name_literal:Joe would only match when the full name is Joe while q=name_tokenized:joe would match for any name with joe. Similarly, q=name_lit:Jo* would only match names with Jo as a prefix, while q=name_tok:jo* would match Joe Smith as well as Bill Johnson.

        The user might also opt to copy to yet a third field which is tokenized but with the so-called keyword tokenizer which permits the string to be normalized but not broken into tokens. The common case is to lower case, but other common cases would be to eliminate punctuation, replace certain prefixes and suffixes, or whatever.

        The real point there is that "exact" match is still a range of possibilities.

        One of the issues here for Cassandra is whether you really want to combine these two separate exactness semantics that Solr keeps separate.

        Show
        jkrupan Jack Krupansky added a comment - For reference, over in Solr land users constantly struggle with how to combine exact and partial matching - sometimes they want an absolute literal match for the full field/column, sometimes a wildcard on that full field, sometimes keyword tokenization, sometimes wildcard on tokenized terms, sometimes phrases of tokenized terms, and sometimes phrases from the full literal string. Unfortunately, Solr doesn't have a direct answer for that, so people are forced to copy the field (typically a <copyyField>) directive and then one field is the literal string and the other is the tokenized field. That gives them complete control at query time, so q=name_literal:Joe would only match when the full name is Joe while q=name_tokenized:joe would match for any name with joe. Similarly, q=name_lit:Jo* would only match names with Jo as a prefix, while q=name_tok:jo* would match Joe Smith as well as Bill Johnson. The user might also opt to copy to yet a third field which is tokenized but with the so-called keyword tokenizer which permits the string to be normalized but not broken into tokens. The common case is to lower case, but other common cases would be to eliminate punctuation, replace certain prefixes and suffixes, or whatever. The real point there is that "exact" match is still a range of possibilities. One of the issues here for Cassandra is whether you really want to combine these two separate exactness semantics that Solr keeps separate.
        Hide
        jkrupan Jack Krupansky added a comment -

        Clarification question: Will SASI apply the analyzer to the LIKE string?

        Then... what will happen if that analysis produces more than one term? In Solr land that is expected and the semantics is phrase query. What will SASI do? Will it be an error or be treated as a list of AND terms?

        Show
        jkrupan Jack Krupansky added a comment - Clarification question: Will SASI apply the analyzer to the LIKE string? Then... what will happen if that analysis produces more than one term? In Solr land that is expected and the semantics is phrase query. What will SASI do? Will it be an error or be treated as a list of AND terms?
        Hide
        jkrupan Jack Krupansky added a comment -

        Clarification question about SASI itself (as oppose to Cassandra syntax/semantics): If the column is tokenized, is the original raw literal text for each column also still available for indexing or are only the tokenized/analyzed terms indexed?

        Show
        jkrupan Jack Krupansky added a comment - Clarification question about SASI itself (as oppose to Cassandra syntax/semantics): If the column is tokenized, is the original raw literal text for each column also still available for indexing or are only the tokenized/analyzed terms indexed?
        Hide
        xedin Pavel Yaskevich added a comment -

        DOAN DuyHai I don't disagree on the semantics of "=" and what you have presented is exactly how it works right now, I only disagree that we need to force people to use LIKE '<term>' where "=" would be rather sufficient because LIKE 'term' and "=" although they can have the same meaning not necessarily do have it in different circumstances.

        Jack Krupansky I suggest you to at least read the doc doc/SASI.md or even better look at the code.

        Show
        xedin Pavel Yaskevich added a comment - DOAN DuyHai I don't disagree on the semantics of "=" and what you have presented is exactly how it works right now, I only disagree that we need to force people to use LIKE '<term>' where "=" would be rather sufficient because LIKE 'term' and "=" although they can have the same meaning not necessarily do have it in different circumstances. Jack Krupansky I suggest you to at least read the doc doc/SASI.md or even better look at the code.
        Hide
        xedin Pavel Yaskevich added a comment -

        Sam Tunnicliffe

        Fair enough, comprehensive validation at index creation is difficult without knowing or limiting the set of expected queries.

        I think what we probably will end up doing is instead of 'mode' we'll ask users to specify what kind of operations do the want to perform and based on that pick the mode internally, but let's maybe leave this to a separate issue since it's unclear yet what the interface is going to be.

        Log messages generated during flush are generally written at DEBUG level now (and so don't appear in system.log with the default config). There are a few related messages from SASI still logged at INFO, can those be switched to DEBUG?

        Flushing/Loading of the indexes is intentionally left at INFO to give operators more visibility into what is going on and how much time do things take, that's the only things which are currently logged.

        One more nit in the docs: IndexMemtable section still refers to NORMAL and SUFFIX indexing modes.

        Can you please point me exactly where it is because I don't see it anywhere and grep doesn't show anything in the CASSANDRA-11067 branch.

        Show
        xedin Pavel Yaskevich added a comment - Sam Tunnicliffe Fair enough, comprehensive validation at index creation is difficult without knowing or limiting the set of expected queries. I think what we probably will end up doing is instead of 'mode' we'll ask users to specify what kind of operations do the want to perform and based on that pick the mode internally, but let's maybe leave this to a separate issue since it's unclear yet what the interface is going to be. Log messages generated during flush are generally written at DEBUG level now (and so don't appear in system.log with the default config). There are a few related messages from SASI still logged at INFO, can those be switched to DEBUG? Flushing/Loading of the indexes is intentionally left at INFO to give operators more visibility into what is going on and how much time do things take, that's the only things which are currently logged. One more nit in the docs: IndexMemtable section still refers to NORMAL and SUFFIX indexing modes. Can you please point me exactly where it is because I don't see it anywhere and grep doesn't show anything in the CASSANDRA-11067 branch.
        Hide
        rustyrazorblade Jon Haddad added a comment -

        I like to test this stuff by feel. This feels perfectly reasonable to me (current behavior off the 11067 branch):

        create table users ( id int primary key, first_name text, last_name text);
        
        insert into users (id, first_name, last_name) VALUES (1, 'Jon', 'Haddad');
        insert into users (id, first_name, last_name) values (2, 'Jonathan', 'Ellis');
        
        create custom INDEX on users (first_name)
            USING 'org.apache.cassandra.index.sasi.SASIIndex'
            WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive':'false'};
        
        
        cqlsh:test> select * from users where first_name = 'Jon';
        
             id | first_name | last_name
            ----+------------+-----------
              1 |        Jon |    Haddad
        
            (1 rows)
        
        cqlsh:test> select * from users where first_name LIKE 'Jon';
        
         id | first_name | last_name
        ----+------------+-----------
          1 |        Jon |    Haddad
        
        (1 rows)
        cqlsh:test> select * from users where first_name LIKE 'Jon%';
        
         id | first_name | last_name
        ----+------------+-----------
          1 |        Jon |    Haddad
          2 |   Jonathan |     Ellis
        

        Side note, trunk currently returns rows 1 & 2 for = queries, so the behavior is, IMO, significantly improved in this branch.

        Show
        rustyrazorblade Jon Haddad added a comment - I like to test this stuff by feel. This feels perfectly reasonable to me (current behavior off the 11067 branch): create table users ( id int primary key, first_name text, last_name text); insert into users (id, first_name, last_name) VALUES (1, 'Jon', 'Haddad'); insert into users (id, first_name, last_name) values (2, 'Jonathan', 'Ellis'); create custom INDEX on users (first_name) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive':' false '}; cqlsh:test> select * from users where first_name = 'Jon'; id | first_name | last_name ----+------------+----------- 1 | Jon | Haddad (1 rows) cqlsh:test> select * from users where first_name LIKE 'Jon'; id | first_name | last_name ----+------------+----------- 1 | Jon | Haddad (1 rows) cqlsh:test> select * from users where first_name LIKE 'Jon%'; id | first_name | last_name ----+------------+----------- 1 | Jon | Haddad 2 | Jonathan | Ellis Side note, trunk currently returns rows 1 & 2 for = queries, so the behavior is, IMO, significantly improved in this branch.
        Hide
        xedin Pavel Yaskevich added a comment -

        Jon Haddad Thanks for confirming! Change "=" behavior is what was an intention of this ticket - make "=" a string equality operation and add additional LIKE functionality which takes on supporting PREFIX, CONTAINS, SUFFIX queries and although LIKE '<term>' is essentially "=" for given index it might not be the case for other types of indexes, that's why I can't support forced use of LIKE '<term>' as replacement for "=" if analyzer is set, I think both of them should be available and implementors of index interface should decide.

        Show
        xedin Pavel Yaskevich added a comment - Jon Haddad Thanks for confirming! Change "=" behavior is what was an intention of this ticket - make "=" a string equality operation and add additional LIKE functionality which takes on supporting PREFIX, CONTAINS, SUFFIX queries and although LIKE '<term>' is essentially "=" for given index it might not be the case for other types of indexes, that's why I can't support forced use of LIKE '<term>' as replacement for "=" if analyzer is set, I think both of them should be available and implementors of index interface should decide.
        Hide
        aweisberg Ariel Weisberg added a comment -

        Pavel Yaskevich I think the docs here may still be slightly off WRT to LIKE vs '=' and the current state of the world.

        The semantics of CQL's = were modified instead of making further modifications of the grammar with the introduction of a LIKE operator. Ideally, CQL would be modified to include such an operator, supporting both prefix and suffix searches.

        Show
        aweisberg Ariel Weisberg added a comment - Pavel Yaskevich I think the docs here may still be slightly off WRT to LIKE vs '=' and the current state of the world. The semantics of CQL's = were modified instead of making further modifications of the grammar with the introduction of a LIKE operator. Ideally, CQL would be modified to include such an operator, supporting both prefix and suffix searches.
        Hide
        xedin Pavel Yaskevich added a comment -

        Ariel Weisberg Thanks for catching that up, modified and pushed! It's always good to have some fresh eyes on it

        Show
        xedin Pavel Yaskevich added a comment - Ariel Weisberg Thanks for catching that up, modified and pushed! It's always good to have some fresh eyes on it
        Hide
        beobal Sam Tunnicliffe added a comment -

        what we probably will end up doing is instead of 'mode' we'll ask users to specify what kind of operations do the want to perform and based on that pick the mode internally, but let's maybe leave this to a separate issue since it's unclear yet what the interface is going to be.

        A separate issue for figuring out the long term solution is great, but I really think that not giving users a crappy experience out of the box in cases where we can easily avoid it is a good idea. Is there a problem with not allowing range queries on literals that I've overlooked?

        Flushing/Loading of the indexes is intentionally left at INFO to give operators more visibility into what is going on and how much time do things take

        Ok, that's reasonable.

        Can you please point me exactly where it is because I don't see it anywhere

        Sorry, my bad I had the original docs open in an old browser tab. smdh

        Show
        beobal Sam Tunnicliffe added a comment - what we probably will end up doing is instead of 'mode' we'll ask users to specify what kind of operations do the want to perform and based on that pick the mode internally, but let's maybe leave this to a separate issue since it's unclear yet what the interface is going to be. A separate issue for figuring out the long term solution is great, but I really think that not giving users a crappy experience out of the box in cases where we can easily avoid it is a good idea. Is there a problem with not allowing range queries on literals that I've overlooked? Flushing/Loading of the indexes is intentionally left at INFO to give operators more visibility into what is going on and how much time do things take Ok, that's reasonable. Can you please point me exactly where it is because I don't see it anywhere Sorry, my bad I had the original docs open in an old browser tab. smdh
        Hide
        xedin Pavel Yaskevich added a comment - - edited

        A separate issue for figuring out the long term solution is great, but I really think that not giving users a crappy experience out of the box in cases where we can easily avoid it is a good idea. Is there a problem with not allowing range queries on literals that I've overlooked?

        The reason being that there is no efficient data structure which does support both range and prefix/suffix traversals on the literal queries as well as it is not very useful to do range queries on words anyway and it wasn't supported originally, only equals was available for literal indexes. Also I would like to note that range queries work file for numerical types since it makes more sense to ask such types for ranges.

        Edit: Sam Tunnicliffe does "is_literal" option have confusing name? It's original intention is to emphasize that given type consists of letters and should be considered a word (or sequence of words) and treated like a one, which means that if "int" column is marked literal is no longer a number but rather a sequence of unicode code points and range operation in it's regular sense it not supported but instead prefix/suffix queries are available.

        Show
        xedin Pavel Yaskevich added a comment - - edited A separate issue for figuring out the long term solution is great, but I really think that not giving users a crappy experience out of the box in cases where we can easily avoid it is a good idea. Is there a problem with not allowing range queries on literals that I've overlooked? The reason being that there is no efficient data structure which does support both range and prefix/suffix traversals on the literal queries as well as it is not very useful to do range queries on words anyway and it wasn't supported originally, only equals was available for literal indexes. Also I would like to note that range queries work file for numerical types since it makes more sense to ask such types for ranges. Edit: Sam Tunnicliffe does "is_literal" option have confusing name? It's original intention is to emphasize that given type consists of letters and should be considered a word (or sequence of words) and treated like a one, which means that if "int" column is marked literal is no longer a number but rather a sequence of unicode code points and range operation in it's regular sense it not supported but instead prefix/suffix queries are available.
        Hide
        beobal Sam Tunnicliffe added a comment -

        I don't think the option name is confusing & I completely understand why range queries are not supported when the column is being treated as a literal. My point is just that in such cases, we know upfront that the execution is going to fail if the operator is RANGE, so we should fail fast, rather than waiting to hit the UnsupportedOperationException in the ConcurrentTrie subclass.

        Show
        beobal Sam Tunnicliffe added a comment - I don't think the option name is confusing & I completely understand why range queries are not supported when the column is being treated as a literal. My point is just that in such cases, we know upfront that the execution is going to fail if the operator is RANGE, so we should fail fast, rather than waiting to hit the UnsupportedOperationException in the ConcurrentTrie subclass.
        Hide
        xedin Pavel Yaskevich added a comment - - edited

        Ok, so maybe we can make supportsExpression for SASI index more dynamic and return "false" in such case, would it go and try to do a range slice if we do that?

        Sam Tunnicliffe I've made and pushed that change, so if the mode has is_literal set and op is RANGE SASIIndex.supportsExpression is going to return false.

        Show
        xedin Pavel Yaskevich added a comment - - edited Ok, so maybe we can make supportsExpression for SASI index more dynamic and return "false" in such case, would it go and try to do a range slice if we do that? Sam Tunnicliffe I've made and pushed that change, so if the mode has is_literal set and op is RANGE SASIIndex.supportsExpression is going to return false.
        Hide
        beobal Sam Tunnicliffe added a comment -

        Excellent, thanks! That's exactly the change I was suggesting.

        Show
        beobal Sam Tunnicliffe added a comment - Excellent, thanks! That's exactly the change I was suggesting.
        Hide
        xedin Pavel Yaskevich added a comment -

        Great! Do you have anything else or are we good to go with this?

        Show
        xedin Pavel Yaskevich added a comment - Great! Do you have anything else or are we good to go with this?
        Hide
        beobal Sam Tunnicliffe added a comment -

        From my PoV I'm good with this, I assume you're going to open a new ticket for you and Jonathan Ellis to thrash out your LIKE vs EQ differences, if necessary.

        Show
        beobal Sam Tunnicliffe added a comment - From my PoV I'm good with this, I assume you're going to open a new ticket for you and Jonathan Ellis to thrash out your LIKE vs EQ differences, if necessary.
        Hide
        doanduyhai DOAN DuyHai added a comment - - edited

        Pavel Yaskevich You have found a beta tester for SASI even it's not released yet. I found a weird issue when inserting ascii and non-ascii character CASSANDRA-11122

        Show
        doanduyhai DOAN DuyHai added a comment - - edited Pavel Yaskevich You have found a beta tester for SASI even it's not released yet. I found a weird issue when inserting ascii and non-ascii character CASSANDRA-11122
        Hide
        xedin Pavel Yaskevich added a comment -

        Yeah, is the plan, we are going to discuss his concerns offline next week. I'll wait until my morning and if nobody else has any objections I'm going to push my rebased version to trunk and close this ticket.

        Show
        xedin Pavel Yaskevich added a comment - Yeah, is the plan, we are going to discuss his concerns offline next week. I'll wait until my morning and if nobody else has any objections I'm going to push my rebased version to trunk and close this ticket.
        Hide
        xedin Pavel Yaskevich added a comment -

        DOAN DuyHai sounds good! Let's address those problems separately from this ticket since it looks like they might be unrelated.

        Show
        xedin Pavel Yaskevich added a comment - DOAN DuyHai sounds good! Let's address those problems separately from this ticket since it looks like they might be unrelated.
        Hide
        xedin Pavel Yaskevich added a comment -

        Committed.

        Show
        xedin Pavel Yaskevich added a comment - Committed.
        Hide
        jbellis Jonathan Ellis added a comment -

        (CASSANDRA-11130 addresses my concern about overloading the equals operator. Thanks!)

        Show
        jbellis Jonathan Ellis added a comment - ( CASSANDRA-11130 addresses my concern about overloading the equals operator. Thanks!)
        Hide
        omichallat Olivier Michallat added a comment -

        It's not possible to bind LIKE's argument in a prepared statement (the grammar requires a string literal). Is this an oversight or do we have any reason not to allow it?

        Show
        omichallat Olivier Michallat added a comment - It's not possible to bind LIKE's argument in a prepared statement (the grammar requires a string literal). Is this an oversight or do we have any reason not to allow it?
        Hide
        xedin Pavel Yaskevich added a comment -

        Olivier Michallat I think it might be either depending on what you are trying to do because "%" are required but have to be user provided.

        Show
        xedin Pavel Yaskevich added a comment - Olivier Michallat I think it might be either depending on what you are trying to do because "%" are required but have to be user provided.
        Hide
        omichallat Olivier Michallat added a comment -

        Using the Java driver for example:

        PreparedStatement pst = session.prepare("select * from test.users where first_name LIKE ?");
        BoundStatement bs = pst.bind("Jon%");
        

        The first line fails with SyntaxError: line 1:47 mismatched input '?' expecting STRING_LITERAL (which makes sense since it's how it's declared in the grammar). Other operators declare the right-hand side value as a Term.Raw, which can also be a bind marker.

        I think users will expect to be able to bind the argument this way.

        Show
        omichallat Olivier Michallat added a comment - Using the Java driver for example: PreparedStatement pst = session.prepare( "select * from test.users where first_name LIKE ?" ); BoundStatement bs = pst.bind( "Jon%" ); The first line fails with SyntaxError: line 1:47 mismatched input '?' expecting STRING_LITERAL (which makes sense since it's how it's declared in the grammar). Other operators declare the right-hand side value as a Term.Raw , which can also be a bind marker. I think users will expect to be able to bind the argument this way.
        Hide
        xedin Pavel Yaskevich added a comment -

        Definitely sounds like a bug, I've created CASSANDRA-11456 to track that.

        Show
        xedin Pavel Yaskevich added a comment - Definitely sounds like a bug, I've created CASSANDRA-11456 to track that.
        Hide
        dbrosius Dave Brosius added a comment - - edited

        Pavel Yaskevich

        nitpick

        PrefixTermTree.search(Expression e)

        sanity checks for e == null, but then dereferences it anyway, here
        super.search(e)

        maybe just return emptySet() on e == null

        Show
        dbrosius Dave Brosius added a comment - - edited Pavel Yaskevich nitpick PrefixTermTree.search(Expression e) sanity checks for e == null, but then dereferences it anyway, here super.search(e) maybe just return emptySet() on e == null
        Hide
        xedin Pavel Yaskevich added a comment -

        Thanks for noticing that, Dave Brosius and sorry for late response, I just got to this!.. Alex Petrov Can you please include this as part of CASSANDRA-11990?

        Show
        xedin Pavel Yaskevich added a comment - Thanks for noticing that, Dave Brosius and sorry for late response, I just got to this!.. Alex Petrov Can you please include this as part of CASSANDRA-11990 ?
        Hide
        ifesdjeen Alex Petrov added a comment -

        @xedin sure, did that.

        Show
        ifesdjeen Alex Petrov added a comment - @xedin sure, did that.
        Hide
        xedin Pavel Yaskevich added a comment -

        Thanks!

        Show
        xedin Pavel Yaskevich added a comment - Thanks!

          People

          • Assignee:
            xedin Pavel Yaskevich
            Reporter:
            jbellis Jonathan Ellis
            Reviewer:
            Sam Tunnicliffe
          • Votes:
            2 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development