Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Core
    • Labels:

      Description

      Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:

      • Multiple IndexClauses only work when there is a subset of rows under the highest clause
      • One new column family is created per index this means 10 new CFs for 10 secondary indexes

      This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.

      There are a few parallels we can draw between Cassandra and Lucene.

      Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.

      We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.

      The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

        Issue Links

          Activity

          Hide
          Jason Rutherglen added a comment -

          Jake, this looks good. We need to specify how configuration parameters are passed into the Lucene secondary index. This needs to include things like the local Lucene file path, a class to transform Cassandra CF rows into Lucene documents, etc.

          Show
          Jason Rutherglen added a comment - Jake, this looks good. We need to specify how configuration parameters are passed into the Lucene secondary index. This needs to include things like the local Lucene file path, a class to transform Cassandra CF rows into Lucene documents, etc.
          Hide
          Jason Rutherglen added a comment -

          This will be very similar to what's being added to HBase. We can borrow some design techniques etc.

          Show
          Jason Rutherglen added a comment - This will be very similar to what's being added to HBase. We can borrow some design techniques etc.
          Hide
          Jason Rutherglen added a comment -

          Realtime search will benefit the indexing speed of the Cassandra secondary index.

          Show
          Jason Rutherglen added a comment - Realtime search will benefit the indexing speed of the Cassandra secondary index.
          Hide
          Jonathan Ellis added a comment -

          For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

          Could we go for a deeper level of integration? Instead of storing the data twice as Cassandra row + Lucene document, use the row as the document Source Of Truth, and just let Lucene handle the indexes?

          Show
          Jonathan Ellis added a comment - For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back. Could we go for a deeper level of integration? Instead of storing the data twice as Cassandra row + Lucene document, use the row as the document Source Of Truth, and just let Lucene handle the indexes?
          Hide
          T Jake Luciani added a comment -

          We need to specify how configuration parameters are passed into the Lucene secondary index. This needs to include things like the local Lucene file path, a class to transform Cassandra CF rows into Lucene documents, etc.

          The secondary indexes would go into the data directory defined in cassandra.yaml, currently there is a dir per KeySpace, we can create a subdir like "indexes" were the lucene indexes are stored.

          As for transforms, I mentioned column validators. This is meta information about the contents of columns, see http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes

          This validation_class can be extended to let users map columns to lucene analyzer.

          The document would be a row: fields would be columns (with analyzers specified in the column meta-data validation_class)

          Show
          T Jake Luciani added a comment - We need to specify how configuration parameters are passed into the Lucene secondary index. This needs to include things like the local Lucene file path, a class to transform Cassandra CF rows into Lucene documents, etc. The secondary indexes would go into the data directory defined in cassandra.yaml, currently there is a dir per KeySpace, we can create a subdir like "indexes" were the lucene indexes are stored. As for transforms, I mentioned column validators. This is meta information about the contents of columns, see http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes This validation_class can be extended to let users map columns to lucene analyzer. The document would be a row: fields would be columns (with analyzers specified in the column meta-data validation_class)
          Hide
          T Jake Luciani added a comment -

          Could we go for a deeper level of integration? Instead of storing the data twice as Cassandra row + Lucene document, use the row as the document Source Of Truth, and just let Lucene handle the indexes?

          Yes sure, but still requires constructing the full row before writing it to the index, since the client may be updating field 1 but indexes are on field 1 and field 2

          Show
          T Jake Luciani added a comment - Could we go for a deeper level of integration? Instead of storing the data twice as Cassandra row + Lucene document, use the row as the document Source Of Truth, and just let Lucene handle the indexes? Yes sure, but still requires constructing the full row before writing it to the index, since the client may be updating field 1 but indexes are on field 1 and field 2
          Hide
          Jonathan Ellis added a comment -

          Right. I didn't mean to imply this solves read-before-write, only that I'd like to avoid writing two copies of the base data.

          Show
          Jonathan Ellis added a comment - Right. I didn't mean to imply this solves read-before-write, only that I'd like to avoid writing two copies of the base data.
          Hide
          Jason Rutherglen added a comment -

          I'd like to avoid writing two copies of the base data

          Cassandra only needs to store the row UID as a Lucene document.

          Show
          Jason Rutherglen added a comment - I'd like to avoid writing two copies of the base data Cassandra only needs to store the row UID as a Lucene document.
          Hide
          Jason Rutherglen added a comment -

          Does Cassandra have a built in RPC mechanism we can use to send the [Lucene] queries to the distributed servers?

          Show
          Jason Rutherglen added a comment - Does Cassandra have a built in RPC mechanism we can use to send the [Lucene] queries to the distributed servers?
          Hide
          Jonathan Ellis added a comment -

          Yes. Look at uses of MessagingService.

          Show
          Jonathan Ellis added a comment - Yes. Look at uses of MessagingService.
          Hide
          Jason Rutherglen added a comment -

          I looked at MessagingService which seems to be more [custom] asynchronous?

          I think we could offer a Thrift API? What does CQL use?

          I think we'd want to look towards making this [Lucene] play well / integrate with CQL?

          Show
          Jason Rutherglen added a comment - I looked at MessagingService which seems to be more [custom] asynchronous? I think we could offer a Thrift API? What does CQL use? I think we'd want to look towards making this [Lucene] play well / integrate with CQL?
          Hide
          Todd Nine added a comment - - edited

          Hey guys. We're doing something similar in the hector JPA plugin.

          Would using dynamic composites within cassandra alleviate the need for Lucene documents? We're using this in secondary indexing and it gives us order by semantics and AND (Union). The largest issue becomes iteration with OR clauses, AND clauses can be compressed into a single column for efficient range scans, we then use iterators to UNION the OR trees together with order clauses in the composites. The caveat is that the user must define indexes with order semantics up front. However this can easily be added to the existing secondary indexing clauses.

          Show
          Todd Nine added a comment - - edited Hey guys. We're doing something similar in the hector JPA plugin. Would using dynamic composites within cassandra alleviate the need for Lucene documents? We're using this in secondary indexing and it gives us order by semantics and AND (Union). The largest issue becomes iteration with OR clauses, AND clauses can be compressed into a single column for efficient range scans, we then use iterators to UNION the OR trees together with order clauses in the composites. The caveat is that the user must define indexes with order semantics up front. However this can easily be added to the existing secondary indexing clauses.
          Hide
          Jason Rutherglen added a comment -

          Would using dynamic composites within cassandra alleviate the need for Lucene documents?

          I think it is hard to duplicate the efficiency of Lucene for dis/conjunction queries (OR / AND), especially with PFOR implemented (a CPU directed enhanced system for decoding integers on todays microprocessors).

          We can/will turn off scoring which further makes Lucene a straight query execution engine, as opposed to a free text search engine. Range queries in Lucene use a trie system which is highly effective.

          Show
          Jason Rutherglen added a comment - Would using dynamic composites within cassandra alleviate the need for Lucene documents? I think it is hard to duplicate the efficiency of Lucene for dis/conjunction queries (OR / AND), especially with PFOR implemented (a CPU directed enhanced system for decoding integers on todays microprocessors). We can/will turn off scoring which further makes Lucene a straight query execution engine, as opposed to a free text search engine. Range queries in Lucene use a trie system which is highly effective.
          Hide
          Jason Rutherglen added a comment -

          I think the open design question on this one is distributed search, and how a distributed search client will know which Cassandra servers to send a query to. Meaning, traditionally a query is sent to N servers whose responses are merged and X results are returned. We can send a query to all servers however I think we'd then have duplicate rows/documents returned. How does CQL handle this?

          Show
          Jason Rutherglen added a comment - I think the open design question on this one is distributed search, and how a distributed search client will know which Cassandra servers to send a query to. Meaning, traditionally a query is sent to N servers whose responses are merged and X results are returned. We can send a query to all servers however I think we'd then have duplicate rows/documents returned. How does CQL handle this?
          Hide
          Todd Nine added a comment -

          I'm quite keen to contribute on this issue, as this will greatly enhance the functionality of the hector-jpa project. If I can contribute any work, please let me know.

          Show
          Todd Nine added a comment - I'm quite keen to contribute on this issue, as this will greatly enhance the functionality of the hector-jpa project. If I can contribute any work, please let me know.
          Hide
          T Jake Luciani added a comment -

          Todd: once CASSANDRA-2982 is done we can get started. I'm trying to focus on that right now. In the meantime we need to think of how to link lucene analyzers to column_metadata.

          Jason: This currently works by executing the query locally if that does not have enough results it moves on to the next node. since the ring is split we know the range of keys to restrict the search to. this avoids dups

          Show
          T Jake Luciani added a comment - Todd: once CASSANDRA-2982 is done we can get started. I'm trying to focus on that right now. In the meantime we need to think of how to link lucene analyzers to column_metadata. Jason: This currently works by executing the query locally if that does not have enough results it moves on to the next node. since the ring is split we know the range of keys to restrict the search to. this avoids dups
          Hide
          Jason Rutherglen added a comment -

          This currently works by executing the query locally if that does not have enough results it moves on to the next node.

          Ok. Typically in distributed search one needs/wants to send the request to all of the possible nodes that contain data pertinent to the query. Is this possible?

          In the meantime we need to think of how to link lucene analyzers to column_metadata

          Can we simply define a class that intercepts row updates for a column family? Then that class can implement what is needed to analyze the columns / row?

          Show
          Jason Rutherglen added a comment - This currently works by executing the query locally if that does not have enough results it moves on to the next node. Ok. Typically in distributed search one needs/wants to send the request to all of the possible nodes that contain data pertinent to the query. Is this possible? In the meantime we need to think of how to link lucene analyzers to column_metadata Can we simply define a class that intercepts row updates for a column family? Then that class can implement what is needed to analyze the columns / row?
          Hide
          T Jake Luciani added a comment -

          Ok. Typically in distributed search one needs/wants to send the request to all of the possible nodes that contain data pertinent to the query. Is this possible?

          see CASSANDRA-1337 it's going to always need to hit all the nodes in a worst case (or if we add support for order by in CQL)

          Can we simply define a class that intercepts row updates for a column family? Then that class can implement what is needed to analyze the columns / row?

          The problem is the Type class can be user defined. So this doesn't get us very far, I was thinking we add a new method to AbtractType class that can be set. like getLuceneAnalyzer()

          Show
          T Jake Luciani added a comment - Ok. Typically in distributed search one needs/wants to send the request to all of the possible nodes that contain data pertinent to the query. Is this possible? see CASSANDRA-1337 it's going to always need to hit all the nodes in a worst case (or if we add support for order by in CQL) Can we simply define a class that intercepts row updates for a column family? Then that class can implement what is needed to analyze the columns / row? The problem is the Type class can be user defined. So this doesn't get us very far, I was thinking we add a new method to AbtractType class that can be set. like getLuceneAnalyzer()
          Hide
          Jason Rutherglen added a comment -

          like getLuceneAnalyzer()

          There won't always be a 1 to 1 mapping of a column to a field. For example in Solr, there is copy field, which essentially creates a new field. Also Analyzer is for any field, the right per-field class would be Tokenizer.

          I strongly believe we need to have an interface that accepts a row and essentially generates a Lucene Document. This should be the most straightforward approach that enables just about anything, including using a Solr schema at some point.

          Show
          Jason Rutherglen added a comment - like getLuceneAnalyzer() There won't always be a 1 to 1 mapping of a column to a field. For example in Solr, there is copy field, which essentially creates a new field. Also Analyzer is for any field, the right per-field class would be Tokenizer. I strongly believe we need to have an interface that accepts a row and essentially generates a Lucene Document. This should be the most straightforward approach that enables just about anything, including using a Solr schema at some point.
          Hide
          Todd Nine added a comment -

          A couple questions.

          1. Will read after write be available? I.E if your mutation for the row key returns to the client, then the row now has an entry in the Lucence index, which can immediately be queried to return the results.

          2. What about durability, in the event cassandra crashes, will the Lucene index retain these indexed values, or will they be lost if commit is not invoked on the index?

          Show
          Todd Nine added a comment - A couple questions. 1. Will read after write be available? I.E if your mutation for the row key returns to the client, then the row now has an entry in the Lucence index, which can immediately be queried to return the results. 2. What about durability, in the event cassandra crashes, will the Lucene index retain these indexed values, or will they be lost if commit is not invoked on the index?
          Hide
          T Jake Luciani added a comment -

          Will read after write be available? I.E if your mutation for the row key returns to the client, then the row now has an entry in the Lucence index, which can immediately be queried to return the results.

          Yes. We can use a RAMDirectory() to keep writes real-time.

          What about durability, in the event cassandra crashes, will the Lucene index retain these indexed values, or will they be lost if commit is not invoked on the index?

          When the memtable is flushed. we will merge the RAMDirectory index into the FSDirectory index and call reopen().

          Show
          T Jake Luciani added a comment - Will read after write be available? I.E if your mutation for the row key returns to the client, then the row now has an entry in the Lucence index, which can immediately be queried to return the results. Yes. We can use a RAMDirectory() to keep writes real-time. What about durability, in the event cassandra crashes, will the Lucene index retain these indexed values, or will they be lost if commit is not invoked on the index? When the memtable is flushed. we will merge the RAMDirectory index into the FSDirectory index and call reopen().
          Hide
          Jason Rutherglen added a comment -

          Yes. We can use a RAMDirectory() to keep writes real-time.

          LUCENE-3092 implemented NRTCachingDirectory which we can use for in RAM NRT until LUCENE-2312 is completed.

          Show
          Jason Rutherglen added a comment - Yes. We can use a RAMDirectory() to keep writes real-time. LUCENE-3092 implemented NRTCachingDirectory which we can use for in RAM NRT until LUCENE-2312 is completed.
          Hide
          T Jake Luciani added a comment -

          LUCENE-2454 adds support for nested documents. we can perhaps use this to avoid the read before write. We could create a document per field and nest them together under a row level parent doc

          Show
          T Jake Luciani added a comment - LUCENE-2454 adds support for nested documents. we can perhaps use this to avoid the read before write. We could create a document per field and nest them together under a row level parent doc
          Hide
          T Jake Luciani added a comment -

          Another issue we need to work around is Expiring columns... We could store the expiration time in the document and make it a constraint on the lucene query so we don't pull expired data.

          Show
          T Jake Luciani added a comment - Another issue we need to work around is Expiring columns... We could store the expiration time in the document and make it a constraint on the lucene query so we don't pull expired data.
          Hide
          Jason Rutherglen added a comment -

          LUCENE-2454 adds support for nested documents. we can perhaps use this to avoid the read before write

          I think LUCENE-2454 needs the nested documents to be added at the same time. In our case that wouldn't be happening. Google's GData for example doesn't offer the feature of automatically retrieving values from the previous document, it assumes you are replacing the entire document with new contents, and relies on the user to have read the document [somewhere] before.

          I think there's another Lucene issue that performs an initial query to obtain the parent document. However that is the same as a read before write.

          I'm guessing Cassandra enables updating an individual column? I don't think there's any way around this?

          We could store the expiration time in the document and make it a constraint on the lucene query so we don't pull expired data

          That would work. We'd need to use a trie range filter query, which will make all queries a little bit slower.

          Show
          Jason Rutherglen added a comment - LUCENE-2454 adds support for nested documents. we can perhaps use this to avoid the read before write I think LUCENE-2454 needs the nested documents to be added at the same time. In our case that wouldn't be happening. Google's GData for example doesn't offer the feature of automatically retrieving values from the previous document, it assumes you are replacing the entire document with new contents, and relies on the user to have read the document [somewhere] before. I think there's another Lucene issue that performs an initial query to obtain the parent document. However that is the same as a read before write. I'm guessing Cassandra enables updating an individual column? I don't think there's any way around this? We could store the expiration time in the document and make it a constraint on the lucene query so we don't pull expired data That would work. We'd need to use a trie range filter query, which will make all queries a little bit slower.
          Hide
          Jason Rutherglen added a comment -

          I think it's important to note all of the many SQL'like features Lucene has [now].

          ORDER BY, GROUP BY, COUNT / facet, AND / OR queries, LIKE. This makes Lucene ideal for CQL and it's goals.

          Show
          Jason Rutherglen added a comment - I think it's important to note all of the many SQL'like features Lucene has [now] . ORDER BY, GROUP BY, COUNT / facet, AND / OR queries, LIKE. This makes Lucene ideal for CQL and it's goals.
          Hide
          Ryan King added a comment -

          Regarding realtime search, hasn't our (twitter's) realtime search branch been merged into lucene trunk? Whenever that's available we should get real realtime results.

          Show
          Ryan King added a comment - Regarding realtime search, hasn't our (twitter's) realtime search branch been merged into lucene trunk? Whenever that's available we should get real realtime results.
          Hide
          Jason Rutherglen added a comment -

          Regarding realtime search, hasn't our (twitter's) realtime search branch been merged into lucene trunk?

          There's LUCENE-2312. Twitter's RT search is highly specialized (yes I'm familiar with it), eg, Lucene is far too general (think of payloads, phrase queries, span queries, etc) for the code Twitter has to be merged into. If Twitter's search were to be integrated, there would be an awful lot of refactoring of Lucene required.

          Show
          Jason Rutherglen added a comment - Regarding realtime search, hasn't our (twitter's) realtime search branch been merged into lucene trunk? There's LUCENE-2312 . Twitter's RT search is highly specialized (yes I'm familiar with it), eg, Lucene is far too general (think of payloads, phrase queries, span queries, etc) for the code Twitter has to be merged into. If Twitter's search were to be integrated, there would be an awful lot of refactoring of Lucene required.
          Hide
          Jason Rutherglen added a comment -

          Which physical directory do we want to place the Lucene indexes?

          Show
          Jason Rutherglen added a comment - Which physical directory do we want to place the Lucene indexes?
          Hide
          T Jake Luciani added a comment -

          Under the CF dir I imagine

          Show
          T Jake Luciani added a comment - Under the CF dir I imagine
          Hide
          Todd Nine added a comment - - edited

          I don't necessarily think there is a 1 to 1 relationship between a column and a Lucene document field. In our case we have the need to index fields in more than one manner. For instance, we index users as straight strings (lowercased) with email, first name and last name columns. However we also want to tokenize the email, first and last name columns to allow our customer support people to perform partial name matching. I think a 1 to N mapping is required for column to document field to allow this sort of functionality.

          As far as expiration on columns, is there a system event that we can hook into to just force a document reindex when a column expires rather than add an additional field that will need to be sorted from?

          As per Jason's previous post, I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must. Most users have become accustomed to this functionality with RDBMS. If they cause potential performance problems, I think this should be documented so that users have enough information to determine if they can rely on the Lucene index or should build their own index directly.

          Has anyone looked at existing code in ElasticSearch to avoid some of the pitfalls they have already experienced in building something similar?

          http://www.elasticsearch.org/

          Lastly, this is a huge feature for the hector-jpa plugin, what can I do to help?

          Show
          Todd Nine added a comment - - edited I don't necessarily think there is a 1 to 1 relationship between a column and a Lucene document field. In our case we have the need to index fields in more than one manner. For instance, we index users as straight strings (lowercased) with email, first name and last name columns. However we also want to tokenize the email, first and last name columns to allow our customer support people to perform partial name matching. I think a 1 to N mapping is required for column to document field to allow this sort of functionality. As far as expiration on columns, is there a system event that we can hook into to just force a document reindex when a column expires rather than add an additional field that will need to be sorted from? As per Jason's previous post, I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must. Most users have become accustomed to this functionality with RDBMS. If they cause potential performance problems, I think this should be documented so that users have enough information to determine if they can rely on the Lucene index or should build their own index directly. Has anyone looked at existing code in ElasticSearch to avoid some of the pitfalls they have already experienced in building something similar? http://www.elasticsearch.org/ Lastly, this is a huge feature for the hector-jpa plugin, what can I do to help?
          Hide
          Todd Nine added a comment -

          Could we also use this feature as a standard way for building our lucene documents? This would accomplish what we want, as well as giving a hook for more user functionality.

          CASSANDRA-1311

          Show
          Todd Nine added a comment - Could we also use this feature as a standard way for building our lucene documents? This would accomplish what we want, as well as giving a hook for more user functionality. CASSANDRA-1311
          Hide
          Jason Rutherglen added a comment -

          Todd,

          Another option is to add a [user optional] class that converts raw Cassandra columns into a Lucene document. Implicitly the Cassandra columns do not need to map to Lucene document fields. This is more of a slight change in the user's expectations for CQL rather than a core functional change. Eg, the CQL submitted to a Lucene secondary index may refer to Lucene fields that do not exist as columns.

          Show
          Jason Rutherglen added a comment - Todd, Another option is to add a [user optional] class that converts raw Cassandra columns into a Lucene document. Implicitly the Cassandra columns do not need to map to Lucene document fields. This is more of a slight change in the user's expectations for CQL rather than a core functional change. Eg, the CQL submitted to a Lucene secondary index may refer to Lucene fields that do not exist as columns.
          Hide
          Ed Anuff added a comment -

          +1 on having the ability to provide a conversion class for handling transformations from columns to Lucene documents. It's not uncommon for people to store objects serialized to JSON or other some other serialization format into columns. CQL will have to catch up with this practice at some point.

          Show
          Ed Anuff added a comment - +1 on having the ability to provide a conversion class for handling transformations from columns to Lucene documents. It's not uncommon for people to store objects serialized to JSON or other some other serialization format into columns. CQL will have to catch up with this practice at some point.
          Hide
          Todd Nine added a comment - - edited

          I think forcing users to install classes for common use cases would cause issues with adoption. What about creating new CQL commands to handle this? When creating an index in a db, you would define the fields and the manner in which they are indexed. Could we do something like the following?

          create index on [colname] in [colfamily] using [index type 1] as [indexFieldName], [index type 2] as [indexFieldName], [index type n] as [indexFieldName]?

          drop index [indexFieldName] in [colfamily] on [colname]

          This way clients such as JPA can update and create indexes, without the need to install custom classes on Cassandra itself. They also have the ability to directly reference the field name when using CQL queries.

          Assuming that the index class types exist in the Lucene classpath, you get the 1 to many mappings for column to indexing strategy. This would allow more advanced clients such as the JPA plugin to automatically add indexes to the document based on indexes defined on persistent fields, without generating any code the user has to install in the Cassandra runtime. If users want to install custom analyzers, they still have the option to do so, and would gain access to it via CQL.

          Show
          Todd Nine added a comment - - edited I think forcing users to install classes for common use cases would cause issues with adoption. What about creating new CQL commands to handle this? When creating an index in a db, you would define the fields and the manner in which they are indexed. Could we do something like the following? create index on [colname] in [colfamily] using [index type 1] as [indexFieldName] , [index type 2] as [indexFieldName] , [index type n] as [indexFieldName] ? drop index [indexFieldName] in [colfamily] on [colname] This way clients such as JPA can update and create indexes, without the need to install custom classes on Cassandra itself. They also have the ability to directly reference the field name when using CQL queries. Assuming that the index class types exist in the Lucene classpath, you get the 1 to many mappings for column to indexing strategy. This would allow more advanced clients such as the JPA plugin to automatically add indexes to the document based on indexes defined on persistent fields, without generating any code the user has to install in the Cassandra runtime. If users want to install custom analyzers, they still have the option to do so, and would gain access to it via CQL.
          Hide
          T Jake Luciani added a comment -

          I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must.

          I don't think supporting GROUP BY and ORDER BY is something we want to support using secondary indexes. The whole idea of scatter gather in cassandra would be a performance killer and promote bad data-modeling practices.

          The goal of this ticket is to support lucene search features with the current secondary index api.

          We can add LIKE, OR, NOT, BETWEEN with this.

          Show
          T Jake Luciani added a comment - I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must. I don't think supporting GROUP BY and ORDER BY is something we want to support using secondary indexes. The whole idea of scatter gather in cassandra would be a performance killer and promote bad data-modeling practices. The goal of this ticket is to support lucene search features with the current secondary index api. We can add LIKE, OR, NOT, BETWEEN with this.
          Hide
          Todd Nine added a comment -

          I agree that order by could be a performance killer for large data sets. In large data sets I think that users should make use of de-normalization and create their own secondary index for efficient querying. However, on small data sets, which seem to be very common in web systems (ours is about 80% of the data a user sees), order by semantics are very important. Most of our data the user sees has a very small result set, < 100 rows. I think explicitly prohibiting these features limit the user too much. Shouldn't they be supported and ultimately it is up to the user to determine which approach they take in implementing index for their data?

          Show
          Todd Nine added a comment - I agree that order by could be a performance killer for large data sets. In large data sets I think that users should make use of de-normalization and create their own secondary index for efficient querying. However, on small data sets, which seem to be very common in web systems (ours is about 80% of the data a user sees), order by semantics are very important. Most of our data the user sees has a very small result set, < 100 rows. I think explicitly prohibiting these features limit the user too much. Shouldn't they be supported and ultimately it is up to the user to determine which approach they take in implementing index for their data?
          Hide
          Todd Nine added a comment -

          If we want group by semantics, this will need to be done first.

          Show
          Todd Nine added a comment - If we want group by semantics, this will need to be done first.
          Hide
          Camille Vergara added a comment -

          If you're interested in seeing this feature implemented, you should consider supporting the fundraiser for bounties on Bountysource: https://www.bountysource.com/fundraisers/508.

          Show
          Camille Vergara added a comment - If you're interested in seeing this feature implemented, you should consider supporting the fundraiser for bounties on Bountysource: https://www.bountysource.com/fundraisers/508 .
          Hide
          Alex Liu added a comment -

          We may need use Twitter's RealTime search.

          Show
          Alex Liu added a comment - We may need use Twitter's RealTime search.
          Hide
          Matt Stump added a comment -

          Given that the read before write issues still stand for non-numeric fields (as of 4.6), is Lucene based secondary indexes still something we want committed in the near term? Do we want to wait until incremental update/stacked segments are available for all field types?

          Additionally, Lucene, even for near realtime search still imposes a delay between when a row is added and when it is query-able which would differ from existing behavior; is this something that we can live with?

          Show
          Matt Stump added a comment - Given that the read before write issues still stand for non-numeric fields (as of 4.6), is Lucene based secondary indexes still something we want committed in the near term? Do we want to wait until incremental update/stacked segments are available for all field types? Additionally, Lucene, even for near realtime search still imposes a delay between when a row is added and when it is query-able which would differ from existing behavior; is this something that we can live with?

            People

            • Assignee:
              Unassigned
              Reporter:
              T Jake Luciani
            • Votes:
              35 Vote for this issue
              Watchers:
              54 Start watching this issue

              Dates

              • Created:
                Updated:

                Development