Details

      Description

      There are several projects using custom secondary indexes as an extension point to integrate C* with other systems such as Solr or Lucene. The usual approach is to embed third party indexing queries in CQL clauses.

      For example, DSE Search embeds Solr syntax this way:

      SELECT title FROM solr WHERE solr_query='title:natio*';
      

      Stratio platform embeds custom JSON syntax for searching in Lucene indexes:

      SELECT * FROM tweets WHERE lucene='{
          filter : {
              type: "range",
              field: "time",
              lower: "2014/04/25",
              upper: "2014/04/1"
          },
          query  : {
              type: "phrase", 
              field: "body", 
              values: ["big", "data"]
          },
          sort  : {fields: [ {field:"time", reverse:true} ] }
      }';
      

      Tuplejump Stargate also uses the Stratio's open source JSON syntax:

      SELECT name,company FROM PERSON WHERE stargate ='{
          filter: {
              type: "range",
              field: "company",
              lower: "a",
              upper: "p"
          },
          sort:{
             fields: [{field:"name",reverse:true}]
          }
      }';
      

      These syntaxes are validated by the corresponding 2i implementation. This validation is done behind the StorageProxy command distribution. So, far as I know, there is no way to give rich feedback about syntax errors to CQL users.

      I'm uploading a patch with some changes trying to improve this. I propose adding an empty validation method to SecondaryIndexSearcher that can be overridden by custom 2i implementations:

      public void validate(List<IndexExpression> clause) {}
      

      And call it from SelectStatement#getRangeCommand:

      ColumnFamilyStore cfs = Keyspace.open(keyspace()).getColumnFamilyStore(columnFamily());
              for (SecondaryIndexSearcher searcher : cfs.indexManager.getIndexSearchersForQuery(expressions))
              {
      
                  try
                  {
                      searcher.validate(expressions);
                  }
                  catch (RuntimeException e)
                  {
                      String exceptionMessage = e.getMessage();
                      if (exceptionMessage != null 
                              && !exceptionMessage.trim().isEmpty())
                          throw new InvalidRequestException(
                                  "Invalid index expression: " + e.getMessage());
                      else
                          throw new InvalidRequestException(
                                  "Invalid index expression");
                  }
              }
      

      In this way C* allows custom 2i implementations to give feedback about syntax errors.

      We are currently using these changes in a fork with no problems.

      1. 2i_validation_v2.patch
        5 kB
        Andrés de la Peña
      2. 2i_validation_v3.patch
        5 kB
        Andrés de la Peña
      3. 2i_validation_v4.patch
        12 kB
        Sergio Bossa
      4. 2i_validation.patch
        3 kB
        Andrés de la Peña

        Issue Links

          Activity

          Hide
          mshuler Michael Shuler added a comment -

          Jonathan Ellis do you think this feature should this be targeted for 2.1.x or 3.0?

          Show
          mshuler Michael Shuler added a comment - Jonathan Ellis do you think this feature should this be targeted for 2.1.x or 3.0?
          Hide
          jbellis Jonathan Ellis added a comment -

          I've never been a fan of the WHERE specialcolumn = X approach to this problem. IMO the "right" solution would use UDF:

          SELECT * FROM tweets 
          WHERE between(time, '2014/04/25', '2014/04/1') AND phrase_contains({'big', 'data'}, body)
          ORDER BY time desc
          

          /cc Sergio Bossa T Jake Luciani

          Show
          jbellis Jonathan Ellis added a comment - I've never been a fan of the WHERE specialcolumn = X approach to this problem. IMO the "right" solution would use UDF: SELECT * FROM tweets WHERE between(time, '2014/04/25', '2014/04/1') AND phrase_contains({'big', 'data'}, body) ORDER BY time desc /cc Sergio Bossa T Jake Luciani
          Hide
          tjake T Jake Luciani added a comment -

          I agree this special column thing was/is a hack.

          Ideally we would support a richer syntax in cql, not sure UDFs are much better. This patch does at least give us a way to notify the user currently the syntax is invalid rather than having to search the logfile for exceptions.

          Show
          tjake T Jake Luciani added a comment - I agree this special column thing was/is a hack. Ideally we would support a richer syntax in cql, not sure UDFs are much better. This patch does at least give us a way to notify the user currently the syntax is invalid rather than having to search the logfile for exceptions.
          Hide
          jbellis Jonathan Ellis added a comment -

          Could we support raw lucene, solr, and elasticsearch approaches with one set of syntax? I suspect we cannot, so I lean towards UDF as the more flexible approach.

          (I'm also -0 on shipping any of those 3 with C* out of the box, and it would be kind of weird to support it syntactically but not ship an implementation.)

          Show
          jbellis Jonathan Ellis added a comment - Could we support raw lucene, solr, and elasticsearch approaches with one set of syntax? I suspect we cannot, so I lean towards UDF as the more flexible approach. (I'm also -0 on shipping any of those 3 with C* out of the box, and it would be kind of weird to support it syntactically but not ship an implementation.)
          Hide
          tjake T Jake Luciani added a comment -

          Well in your specific case if we supported BETWEEN and LIKE in CQL it would map well but yeah there are prob a lot of other things. You could accomplish the same thing we currently do with a single udf like search('title:foo*')

          Show
          tjake T Jake Luciani added a comment - Well in your specific case if we supported BETWEEN and LIKE in CQL it would map well but yeah there are prob a lot of other things. You could accomplish the same thing we currently do with a single udf like search('title:foo*')
          Hide
          adelapena Andrés de la Peña added a comment -

          The lucene/solr index is created with the CQL's create index statement:

          CREATE CUSTOM INDEX IF NOT EXISTS users_index 
          ON tweets (lucene) 
          USING '<custom_index_class>'
          WITH OPTIONS = {<indexing_options_and_schema>};
          

          With your approach, should we previously create the UDF in addition to the 2i? How would we connect the UDF to the secondary index?

          I'm aware that the special column is a hack, but it makes the work and it's so flexible that is being used successfully for a variety of queries such as indexing systems.

          In addition, shipping UDFs independently of the indexes would be a nice feature

          Show
          adelapena Andrés de la Peña added a comment - The lucene/solr index is created with the CQL's create index statement: CREATE CUSTOM INDEX IF NOT EXISTS users_index ON tweets (lucene) USING '<custom_index_class>' WITH OPTIONS = {<indexing_options_and_schema>}; With your approach, should we previously create the UDF in addition to the 2i? How would we connect the UDF to the secondary index? I'm aware that the special column is a hack, but it makes the work and it's so flexible that is being used successfully for a variety of queries such as indexing systems. In addition, shipping UDFs independently of the indexes would be a nice feature
          Hide
          sbtourist Sergio Bossa added a comment -

          I see your point Jonathan Ellis, but agree with Andrés de la Peña: even if it feels like an hack, the special column works and is flexible enough to accommodate disparate implementations; also, improving it has very low implementation cost (it could also be retrofitted to 2.0.x), and doesn't prevent future implementations based on UDFs.

          Show
          sbtourist Sergio Bossa added a comment - I see your point Jonathan Ellis , but agree with Andrés de la Peña : even if it feels like an hack, the special column works and is flexible enough to accommodate disparate implementations; also, improving it has very low implementation cost (it could also be retrofitted to 2.0.x), and doesn't prevent future implementations based on UDFs.
          Hide
          jbellis Jonathan Ellis added a comment -

          All right. Sounds like I'm in the minority here, and for a relatively uninvasive patch I'm willing to bend. Sergio Bossa to review.

          Show
          jbellis Jonathan Ellis added a comment - All right. Sounds like I'm in the minority here, and for a relatively uninvasive patch I'm willing to bend. Sergio Bossa to review.
          Hide
          sbtourist Sergio Bossa added a comment -

          Andrés de la Peña, which C* branch is your patch targeted at?

          Show
          sbtourist Sergio Bossa added a comment - Andrés de la Peña , which C* branch is your patch targeted at?
          Hide
          adelapena Andrés de la Peña added a comment - - edited

          Nice. It's targeted for 2.1.x and it should work in 3.0 too.

          Show
          adelapena Andrés de la Peña added a comment - - edited Nice. It's targeted for 2.1.x and it should work in 3.0 too.
          Hide
          snazy Robert Stupp added a comment -

          The custom column solution is a bit dirty by that you query column 'foo' for something that is in column 'bar'.
          I think we can add support for 2i in UDFs (as an extension to UDFs - not with the basics covered by CASSANDRA-7395) - maybe by providing some information like table/row/column meta data to the UDF information.
          UDFs could also help to cover "complex" queries against lucene/solr/elasticsearch (conditional filter/query against multiple fields, highlighting, etc). For example with something like this to get every row with a full-text-search score > .75. But that's stuff for a separate ticket - the UDF impl would then return a set of primary keys - so it's a "query rewrite" behind the scenes. But the syntax is much more obvious.

          SELECT * FROM my_super_table WHERE elasticsearch('{
              filter: {
                  type: "range",
                  field: "company",
                  lower: "a",
                  upper: "p"
              },
              sort:{
                 fields: [{field:"name",reverse:true}]
              }
          }') > 0.75
          
          Show
          snazy Robert Stupp added a comment - The custom column solution is a bit dirty by that you query column 'foo' for something that is in column 'bar'. I think we can add support for 2i in UDFs (as an extension to UDFs - not with the basics covered by CASSANDRA-7395 ) - maybe by providing some information like table/row/column meta data to the UDF information. UDFs could also help to cover "complex" queries against lucene/solr/elasticsearch (conditional filter/query against multiple fields, highlighting, etc). For example with something like this to get every row with a full-text-search score > .75. But that's stuff for a separate ticket - the UDF impl would then return a set of primary keys - so it's a "query rewrite" behind the scenes. But the syntax is much more obvious. SELECT * FROM my_super_table WHERE elasticsearch('{ filter: { type: "range", field: "company", lower: "a", upper: "p" }, sort:{ fields: [{field:"name",reverse:true}] } }') > 0.75
          Hide
          sbtourist Sergio Bossa added a comment -

          Andrés de la Peña, following the review of your patch, I believe that while it works in practice, pulling the index searchers in SelectStatement#getRangeCommand and validating them that way is a bit odd, more specifically:

          • SelectStatement#getRangeCommand may be called even if a 2i query is not present, so enforcing 2i validation there is a bit misleading and unexpected.
          • SecondaryIndexSearcher#validate is called with the whole list of index expressions, which means each searcher implementation will have to go through the list to inspect each expression and decide if that specific expression was targeted for it and is wrong, or was just for another searcher.

          I'd rather rework the patch in the following way:

          • Add a SecondaryIndexManager#validateIndexSearchersForQuery method that works similarly to getIndexSearchersForQuery, but rather than just getting the index by each column, it also validates it against the proper column/expression by calling SecondaryIndexSearcher#validate(IndexExpression).
          • Call SecondaryIndexManager#validateIndexSearchersForQuery from SelectStatement#RawStatement#validateSecondaryIndexSelections

          That should improve encapsulation and responsibility placement and provide better 2i APIs.

          Finally, I would add a few tests.

          Show
          sbtourist Sergio Bossa added a comment - Andrés de la Peña , following the review of your patch, I believe that while it works in practice, pulling the index searchers in SelectStatement#getRangeCommand and validating them that way is a bit odd, more specifically: SelectStatement#getRangeCommand may be called even if a 2i query is not present, so enforcing 2i validation there is a bit misleading and unexpected. SecondaryIndexSearcher#validate is called with the whole list of index expressions, which means each searcher implementation will have to go through the list to inspect each expression and decide if that specific expression was targeted for it and is wrong, or was just for another searcher. I'd rather rework the patch in the following way: Add a SecondaryIndexManager#validateIndexSearchersForQuery method that works similarly to getIndexSearchersForQuery, but rather than just getting the index by each column, it also validates it against the proper column/expression by calling SecondaryIndexSearcher#validate(IndexExpression). Call SecondaryIndexManager#validateIndexSearchersForQuery from SelectStatement#RawStatement#validateSecondaryIndexSelections That should improve encapsulation and responsibility placement and provide better 2i APIs. Finally, I would add a few tests.
          Hide
          adelapena Andrés de la Peña added a comment -

          Sergio Bossa, here is the patch with the suggested changes. We have run some functional tests and it works fine.

          Show
          adelapena Andrés de la Peña added a comment - Sergio Bossa , here is the patch with the suggested changes. We have run some functional tests and it works fine.
          Hide
          sbtourist Sergio Bossa added a comment -

          Andrés de la Peña, the patch doesn't apply cleanly to cassandra-2.1.

          Show
          sbtourist Sergio Bossa added a comment - Andrés de la Peña , the patch doesn't apply cleanly to cassandra-2.1.
          Hide
          adelapena Andrés de la Peña added a comment -

          Sergio Bossa, I suppose that the problem is due to trailing white spaces in the patch file. I'm uploading a new version without trailing whitespaces. These are the steps I've followed to apply the patch without warnings:

          git clone https://github.com/apache/cassandra.git
          git checkout cassandra-2.1
          git apply 2i_validation_v3.patch
          

          Sorry for the inconvenience.

          Show
          adelapena Andrés de la Peña added a comment - Sergio Bossa , I suppose that the problem is due to trailing white spaces in the patch file. I'm uploading a new version without trailing whitespaces. These are the steps I've followed to apply the patch without warnings: git clone https: //github.com/apache/cassandra.git git checkout cassandra-2.1 git apply 2i_validation_v3.patch Sorry for the inconvenience.
          Hide
          sbtourist Sergio Bossa added a comment - - edited

          Andrés de la Peña, I'm attaching a new patch with the following changes:

          1) SecondaryIndexSearcher#validate now throws an InvalidRequestException, so the "validate" contract of throwing an exception in case of failures is built into the API signature.
          2) SecondaryIndexManager#validateIndexSearchersForQuery is now cleaner and less complex.
          3) SecondaryIndexManager#validateIndexSearchersForQuery call has been moved to SelectStatement#getValidatedIndexExpressions (previously SelectStatement#getIndexExpressions), because of broken tests (as index expressions require the proper query options).
          4) Added a new test to PerRowSecondaryIndexTest.

          Please have a look.

          Show
          sbtourist Sergio Bossa added a comment - - edited Andrés de la Peña , I'm attaching a new patch with the following changes: 1) SecondaryIndexSearcher#validate now throws an InvalidRequestException, so the "validate" contract of throwing an exception in case of failures is built into the API signature. 2) SecondaryIndexManager#validateIndexSearchersForQuery is now cleaner and less complex. 3) SecondaryIndexManager#validateIndexSearchersForQuery call has been moved to SelectStatement#getValidatedIndexExpressions (previously SelectStatement#getIndexExpressions), because of broken tests (as index expressions require the proper query options). 4) Added a new test to PerRowSecondaryIndexTest. Please have a look.
          Hide
          adelapena Andrés de la Peña added a comment -

          That's fine with me.

          Show
          adelapena Andrés de la Peña added a comment - That's fine with me.
          Hide
          sbtourist Sergio Bossa added a comment -

          +1 for v4 patch.

          Show
          sbtourist Sergio Bossa added a comment - +1 for v4 patch.
          Hide
          jbellis Jonathan Ellis added a comment -

          committed

          Show
          jbellis Jonathan Ellis added a comment - committed

            People

            • Assignee:
              adelapena Andrés de la Peña
              Reporter:
              adelapena Andrés de la Peña
              Reviewer:
              Sergio Bossa
            • Votes:
              8 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development