Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4587

Implement Saved Searches a la ElasticSearch Percolator

    Details

      Description

      Use Lucene MemoryIndex for this.

        Activity

        Show
        otis Otis Gospodnetic added a comment - http://search-lucene.com/m/Solr/eHNlnz4JxwIMSo1?subj=Deep+dive+on+the+topic+streaming+expression for anyone who wants to follow.
        Hide
        purswanihemant@gmail.com Hemant Purswani added a comment -

        Thank you Joel. I will follow the thread.

        Show
        purswanihemant@gmail.com Hemant Purswani added a comment - Thank you Joel. I will follow the thread.
        Hide
        joel.bernstein Joel Bernstein added a comment -

        Moving the discussion to the users list.

        The subject will be:

        Deep dive on the topic() streaming expression

        I'll copy your questions above into the first email to the list.

        Show
        joel.bernstein Joel Bernstein added a comment - Moving the discussion to the users list. The subject will be: Deep dive on the topic() streaming expression I'll copy your questions above into the first email to the list.
        Hide
        purswanihemant@gmail.com Hemant Purswani added a comment -

        Hi Joel,

        Yeah, I wasn't sure if I was supposed to post my questions on this jira or on reference guide, so I ended up posting it on both . Thanks for getting back to me. I have couple of questions related to your post.

        1) You mentioned that "The issue here is that it's possible that an out of order version number could persist across commits."

        Is the above possible even if I am using optimistic concurrency (http://yonik.com/solr/optimistic-concurrency/) to write documents on Solr?

        2) Query subscription is going be critical part of my project and our subscribers won't be able to afford loss of alerts. What can I do to make sure that there is not loss of alerts. As long as I get error message whenever there is failure, I will make sure that my system re-tries/replays indexing that specific document.

        3) Do you happen to have any stats about possibility of data loss in Solr. How often does that happen? Are there any best practices that we can follow to avoid it?

        4) In general, are stream expressions robust enough to be used in production?

        5) Is there any more deep dive documentation about topic(). I would love to know its stats for query volume as big as ours (9-10 million). Or, I would love to know how its working internally.

        Thanks again,

        Hemant

        Show
        purswanihemant@gmail.com Hemant Purswani added a comment - Hi Joel, Yeah, I wasn't sure if I was supposed to post my questions on this jira or on reference guide, so I ended up posting it on both . Thanks for getting back to me. I have couple of questions related to your post. 1) You mentioned that "The issue here is that it's possible that an out of order version number could persist across commits." Is the above possible even if I am using optimistic concurrency ( http://yonik.com/solr/optimistic-concurrency/ ) to write documents on Solr? 2) Query subscription is going be critical part of my project and our subscribers won't be able to afford loss of alerts. What can I do to make sure that there is not loss of alerts. As long as I get error message whenever there is failure, I will make sure that my system re-tries/replays indexing that specific document. 3) Do you happen to have any stats about possibility of data loss in Solr. How often does that happen? Are there any best practices that we can follow to avoid it? 4) In general, are stream expressions robust enough to be used in production? 5) Is there any more deep dive documentation about topic(). I would love to know its stats for query volume as big as ours (9-10 million). Or, I would love to know how its working internally. Thanks again, Hemant
        Hide
        joel.bernstein Joel Bernstein added a comment - - edited

        Hi,

        I also saw your post on the reference guide. Let's discuss this a little bit on this ticket and then we can move to users list to continue the discussion.

        1) About SOLR-8709. The issue here is that it's possible that an out of order version number could persist across commits. This would cause a topic to miss documents. But I've tested the topic in many different scenarios and have never been able to make it happen. In all my testing I've never once seen the topic() function fail to retrieve all documents from the topic. Also Solr is not a transactiional system so data loss in general is possible. So I'm not sure the chance of data loss in this scenario is any worse the chance of data loss in Solr in general.

        2) In Solr 6.3 we now have an executor:

        https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-executor

        This allows you to shuffle topics to worker nodes and execute them in parallel. This should scale fairly well.

        Show
        joel.bernstein Joel Bernstein added a comment - - edited Hi, I also saw your post on the reference guide. Let's discuss this a little bit on this ticket and then we can move to users list to continue the discussion. 1) About SOLR-8709 . The issue here is that it's possible that an out of order version number could persist across commits. This would cause a topic to miss documents. But I've tested the topic in many different scenarios and have never been able to make it happen. In all my testing I've never once seen the topic() function fail to retrieve all documents from the topic. Also Solr is not a transactiional system so data loss in general is possible. So I'm not sure the chance of data loss in this scenario is any worse the chance of data loss in Solr in general. 2) In Solr 6.3 we now have an executor: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-executor This allows you to shuffle topics to worker nodes and execute them in parallel. This should scale fairly well.
        Hide
        purswanihemant@gmail.com Hemant Purswani added a comment -

        Seems like topic() function is still in Beta (https://issues.apache.org/jira/browse/SOLR-8709)

        Show
        purswanihemant@gmail.com Hemant Purswani added a comment - Seems like topic() function is still in Beta ( https://issues.apache.org/jira/browse/SOLR-8709 )
        Hide
        purswanihemant@gmail.com Hemant Purswani added a comment -

        Is there a working example of using topic() function for alerts. Is the stream API robust enough to be used in production?

        Show
        purswanihemant@gmail.com Hemant Purswani added a comment - Is there a working example of using topic() function for alerts. Is the stream API robust enough to be used in production?
        Hide
        joel.bernstein Joel Bernstein added a comment -

        We also have the topic() function, which stores it's checkpoints for a topic in a SolrCloud collection. Topics currently just store the checkpoints but we could have it store the query as well. This would satisfy the stored query feature.

        Then you could shuffle stored queries off to worker nodes to be executed in parallel. If you need to scale up you just add more workers and replicas.

        Show
        joel.bernstein Joel Bernstein added a comment - We also have the topic() function, which stores it's checkpoints for a topic in a SolrCloud collection. Topics currently just store the checkpoints but we could have it store the query as well. This would satisfy the stored query feature. Then you could shuffle stored queries off to worker nodes to be executed in parallel. If you need to scale up you just add more workers and replicas.
        Hide
        janhoy Jan Høydahl added a comment -

        Lots of things have happened the last 18 months... We got streaming expressions, which could perhaps be a way for clients to consume the stream of matches in an asynchronous fashion? And we could create a configset for alerting which keeps all the wiring in one place... Joel Bernstein do you think that the daemon() stuff from streaming could be suitable as an API for consuming alerts in this context?

        Show
        janhoy Jan Høydahl added a comment - Lots of things have happened the last 18 months... We got streaming expressions, which could perhaps be a way for clients to consume the stream of matches in an asynchronous fashion? And we could create a configset for alerting which keeps all the wiring in one place... Joel Bernstein do you think that the daemon() stuff from streaming could be suitable as an API for consuming alerts in this context?
        Hide
        janhoy Jan Høydahl added a comment -

        Yea, we need the REST APIs anyway, so the best ting, as Mark says, is starting to flesh out the APIs and it is always easier to comment on concrete proposals.

        Show
        janhoy Jan Høydahl added a comment - Yea, we need the REST APIs anyway, so the best ting, as Mark says, is starting to flesh out the APIs and it is always easier to comment on concrete proposals.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        I think you have the right idea. Just ignore any pushback and get to work. People that put up code decide, and I've seen your work and we will be lucky to have you putting up any code for whatever you want improved or worked on.

        Show
        markrmiller@gmail.com Mark Miller added a comment - I think you have the right idea. Just ignore any pushback and get to work. People that put up code decide, and I've seen your work and we will be lucky to have you putting up any code for whatever you want improved or worked on.
        Hide
        sdavids Steve Davids added a comment -

        I believe we are confusing what Luwak is - Luwak is just an optimized matching algorithm which really belongs in the Lucene package rather than the Solr package. Since this ticket is centered around Solr's implementation of the "percolator" this more has to deal with the registration of queries and providing an API to stream back saved search query ids back to the client that matched a particular document. From a black box perspective that external interface (Solr HTTP API) should be rather simple, though the internal workings could be marked as experimental and can be swapped out for better implementations in the future.

        Show
        sdavids Steve Davids added a comment - I believe we are confusing what Luwak is - Luwak is just an optimized matching algorithm which really belongs in the Lucene package rather than the Solr package. Since this ticket is centered around Solr's implementation of the "percolator" this more has to deal with the registration of queries and providing an API to stream back saved search query ids back to the client that matched a particular document. From a black box perspective that external interface (Solr HTTP API) should be rather simple, though the internal workings could be marked as experimental and can be swapped out for better implementations in the future.
        Hide
        jkrupan Jack Krupansky added a comment -

        as long as the API remains the same

        -1

        Just go with a contrib module ASAP, like even today's Luwak in 5.0, and let people get experience with an "experimental" API, and then debate what the "final", non-contrib API should be, or maybe there might be real benefit with multiple modules with somewhat distinct APIs for different use cases. No need to presume that a one-size-fits-all API is necessarily best here.

        Show
        jkrupan Jack Krupansky added a comment - as long as the API remains the same -1 Just go with a contrib module ASAP, like even today's Luwak in 5.0, and let people get experience with an "experimental" API, and then debate what the "final", non-contrib API should be, or maybe there might be real benefit with multiple modules with somewhat distinct APIs for different use cases. No need to presume that a one-size-fits-all API is necessarily best here.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Does it make sense to others to start with an initial approach then provide optimizations in future releases just as long as the API remains the same?

        +1

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Does it make sense to others to start with an initial approach then provide optimizations in future releases just as long as the API remains the same? +1
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        +1

        Show
        markrmiller@gmail.com Mark Miller added a comment - +1
        Hide
        sdavids Steve Davids added a comment -

        I agree that the Luwak approach provides clever performance optimizations by removing unnecessary queries upfront. Though, Luwak doesn't really solve providing "percolator-like functionality", just provides an optimized matching algorithm. There is a decent amount of work here to allow clients to register queries in a Solr cluster and provide an API to pass a document and have it get matched against registered queries in a distributed manor, none of which is handled by Luwak. I personally believe this ticket can be implemented without Luwak's optimizations and provide value. We could provide a usage caveat that you might not want to register more than 20k queries per shard or so, or if they want to register more queries they can shard out their profiling/matcher collection to take advantage of additional hardware. We can provide an initial implementation then optimize the matching once Luwak dependencies are completed, but from an outside-in perspective the API would remain the same but matching would just be faster at a future point.

        Does it make sense to others to start with an initial approach then provide optimizations in future releases just as long as the API remains the same?

        Show
        sdavids Steve Davids added a comment - I agree that the Luwak approach provides clever performance optimizations by removing unnecessary queries upfront. Though, Luwak doesn't really solve providing "percolator-like functionality", just provides an optimized matching algorithm. There is a decent amount of work here to allow clients to register queries in a Solr cluster and provide an API to pass a document and have it get matched against registered queries in a distributed manor, none of which is handled by Luwak. I personally believe this ticket can be implemented without Luwak's optimizations and provide value. We could provide a usage caveat that you might not want to register more than 20k queries per shard or so, or if they want to register more queries they can shard out their profiling/matcher collection to take advantage of additional hardware. We can provide an initial implementation then optimize the matching once Luwak dependencies are completed, but from an outside-in perspective the API would remain the same but matching would just be faster at a future point. Does it make sense to others to start with an initial approach then provide optimizations in future releases just as long as the API remains the same?
        Hide
        otis Otis Gospodnetic added a comment -

        I believe that much of the job in luwak also comes from the realization that the number of documents must be reduced prior to looping

        That's correct. In our work with Luwak this is the key. You can have 1M queries, but if you really need to run incoming documents against all 1M queries expect to have VERY low throughput and VERY HIGH match latencies. We are working with 1-2M queries and reducing those to a few thousand queries with Luwak's Presearcher, and still have latencies of a few hundred milliseconds.

        Show
        otis Otis Gospodnetic added a comment - I believe that much of the job in luwak also comes from the realization that the number of documents must be reduced prior to looping That's correct. In our work with Luwak this is the key. You can have 1M queries, but if you really need to run incoming documents against all 1M queries expect to have VERY low throughput and VERY HIGH match latencies. We are working with 1-2M queries and reducing those to a few thousand queries with Luwak's Presearcher, and still have latencies of a few hundred milliseconds.
        Hide
        janhoy Jan Høydahl added a comment -

        If Alan Woodward has a vision for how to take Luwak forward (perhaps integrate it as a Lucene module?) why don't we help out on the missing parts and make intervals happen for 5.0 instead of inventing stuff over again. Luwak seems very well engineered and targets needs of the most demanding users, why aim for anything less? Perhaps intervals could be the main selling point of Lucene5.0 and alerting the main new feature for Solr6.0?

        Show
        janhoy Jan Høydahl added a comment - If Alan Woodward has a vision for how to take Luwak forward (perhaps integrate it as a Lucene module?) why don't we help out on the missing parts and make intervals happen for 5.0 instead of inventing stuff over again. Luwak seems very well engineered and targets needs of the most demanding users, why aim for anything less? Perhaps intervals could be the main selling point of Lucene5.0 and alerting the main new feature for Solr6.0?
        Hide
        fmr Fredrik Rodland added a comment - - edited

        Sounds good!

        Having implemented a pretty large system for matching documents against queries (using elasticsearch to index the queries) we discovered very early that filtering the queries was an important requirement to get things running with acceptable performance.

        So I would add to your list of acceptance criteria that the request must support fq and that this is performed prior to the looping. This would enable us to get a smaller list of queries to loop and thus reducing the time to complete the request. For this to work queries also need to support filter-fields - i.e. regular solr fields in addition to the fq, q, defType, etc mentioned above.

        For the record our system has ≈1mill queries, and we're matching ≈10 doc/s. I believe that much of the job in luwak also comes from the realization that the number of documents must be reduced prior to looping. I'm sure Alan Woodward can elaborate on this as well.

        Show
        fmr Fredrik Rodland added a comment - - edited Sounds good! Having implemented a pretty large system for matching documents against queries (using elasticsearch to index the queries) we discovered very early that filtering the queries was an important requirement to get things running with acceptable performance. So I would add to your list of acceptance criteria that the request must support fq and that this is performed prior to the looping. This would enable us to get a smaller list of queries to loop and thus reducing the time to complete the request. For this to work queries also need to support filter-fields - i.e. regular solr fields in addition to the fq, q, defType, etc mentioned above. For the record our system has ≈1mill queries, and we're matching ≈10 doc/s. I believe that much of the job in luwak also comes from the realization that the number of documents must be reduced prior to looping. I'm sure Alan Woodward can elaborate on this as well.
        Hide
        sdavids Steve Davids added a comment -

        I don't think Luwak is really an implementation of this particular feature. It does perform percolating functionality but as a stand-alone library which isn't integrated into Solr. May I suggest that we take a stab at this without waiting around for Luwak since the implementation is dependent on LUCENE-2878 which seems to keep stalling over and over again. The initial approach can take the naive loop across all queries for each document request and at a later point the Luwak approach can be incorporated to provide some nice optimizations. Here are some initial thoughts on acceptance criteria / what can be done to incorporate this functionality into solr:

        1. Able to register a query within a separate Solr core
          • Should take advantage of Solr's sharding ability in Solr Cloud
          • This can piggy-back off of the standard SolrInputDocument semantics with adding/deleting to perform query registration/deregistration.
          • Schema would define various fields for the stored query: q, fq, defType, etc.
        2. Able to specify which query parser should be used when matching docs (persisted w/ query)
        3. Able to specify the other core that the document should be profiled against (this can be at request time if you would like to profile against multiple shards)
          • Allows the profiling to know the fields, analysis chain, etc
        4. Should allow queries to be cached in RAM so they don't need to be re-parsed continually
        5. Custom response handler (perhaps a subclass of the search handler) should make a distributed request to all shards to gather all matching query profile ids and return to the client.

        This is one of those features that would provide a lot of value to users and would be fantastic if we can get incorporated sooner rather than later.

        Show
        sdavids Steve Davids added a comment - I don't think Luwak is really an implementation of this particular feature. It does perform percolating functionality but as a stand-alone library which isn't integrated into Solr. May I suggest that we take a stab at this without waiting around for Luwak since the implementation is dependent on LUCENE-2878 which seems to keep stalling over and over again. The initial approach can take the naive loop across all queries for each document request and at a later point the Luwak approach can be incorporated to provide some nice optimizations. Here are some initial thoughts on acceptance criteria / what can be done to incorporate this functionality into solr: Able to register a query within a separate Solr core Should take advantage of Solr's sharding ability in Solr Cloud This can piggy-back off of the standard SolrInputDocument semantics with adding/deleting to perform query registration/deregistration. Schema would define various fields for the stored query: q, fq, defType, etc. Able to specify which query parser should be used when matching docs (persisted w/ query) Able to specify the other core that the document should be profiled against (this can be at request time if you would like to profile against multiple shards) Allows the profiling to know the fields, analysis chain, etc Should allow queries to be cached in RAM so they don't need to be re-parsed continually Custom response handler (perhaps a subclass of the search handler) should make a distributed request to all shards to gather all matching query profile ids and return to the client. This is one of those features that would provide a lot of value to users and would be fantastic if we can get incorporated sooner rather than later.
        Hide
        otis Otis Gospodnetic added a comment -

        Correct. We can leave this open for now.

        Show
        otis Otis Gospodnetic added a comment - Correct. We can leave this open for now.
        Hide
        arafalov Alexandre Rafalovitch added a comment -

        I guess Luwak https://github.com/flaxsearch/luwak is a related project here.

        Show
        arafalov Alexandre Rafalovitch added a comment - I guess Luwak https://github.com/flaxsearch/luwak is a related project here.

          People

          • Assignee:
            Unassigned
            Reporter:
            otis Otis Gospodnetic
          • Votes:
            21 Vote for this issue
            Watchers:
            48 Start watching this issue

            Dates

            • Created:
              Updated:

              Development