Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Leases/reservations on searcher instances would give us the ability to use the same searcher across phases of a distributed search, or for clients to send multiple requests and have them hit a consistent/unchanging view of the index. The latter requires something extra to ensure that the load balancer contacts the same replicas as before.

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          Multiple leases could lead to searchers piling up ... perhaps it's less expensive to pass around searcher versions and fail+retry distributed requests if the reported version from a node disagrees with the version obtained during earlier phases? I guess it depends on the rate of change, and consequently the likelihood of failure.

          Show
          Andrzej Bialecki added a comment - Multiple leases could lead to searchers piling up ... perhaps it's less expensive to pass around searcher versions and fail+retry distributed requests if the reported version from a node disagrees with the version obtained during earlier phases? I guess it depends on the rate of change, and consequently the likelihood of failure.
          Hide
          Robert Muir added a comment -

          but fail+retry could lead to slow queries piling up?

          Show
          Robert Muir added a comment - but fail+retry could lead to slow queries piling up?
          Hide
          Yonik Seeley added a comment -

          Multiple leases could lead to searchers piling up

          It's not so much the number, but how long.
          A good default would be perhaps 1 sec... long enough for most individual requests to complete, not long enough to cause massive pile-up.

          ... perhaps it's less expensive to pass around searcher versions and fail+retry distributed requests if the reported version from a node disagrees with the version obtained during earlier phases?

          That's a good idea, and would have been a nice simple way to do it before NRT appeared on the scene

          I guess it depends on the rate of change, and consequently the likelihood of failure.

          Right. It would be nice to allow the user to select the trade-off. For example with leases, the user could specify an expensive request (that is hopefully run infrequently) that absolutely needs consistency (and fails if it can't be achieved.) The user could specify a longer lease for this request (like 60 sec) to ensure that it can normally complete. This is much nicer than saying "requests longer than the NRT commit frequency cannot be guaranteed to be correct".

          Show
          Yonik Seeley added a comment - Multiple leases could lead to searchers piling up It's not so much the number, but how long. A good default would be perhaps 1 sec... long enough for most individual requests to complete, not long enough to cause massive pile-up. ... perhaps it's less expensive to pass around searcher versions and fail+retry distributed requests if the reported version from a node disagrees with the version obtained during earlier phases? That's a good idea, and would have been a nice simple way to do it before NRT appeared on the scene I guess it depends on the rate of change, and consequently the likelihood of failure. Right. It would be nice to allow the user to select the trade-off. For example with leases, the user could specify an expensive request (that is hopefully run infrequently) that absolutely needs consistency (and fails if it can't be achieved.) The user could specify a longer lease for this request (like 60 sec) to ensure that it can normally complete. This is much nicer than saying "requests longer than the NRT commit frequency cannot be guaranteed to be correct".
          Hide
          Jason Rutherglen added a comment -

          In RT the searchers are cheap. The easiest approach would be to record the segments and max doc ids used to satisfy phase 1 of a given distributed query, then send that signature back in subsequent phases. The retry would only be necessary in the infrequent case of a merge have occurred.

          With NRT it's probably best to implement a searcher policy similar to index deletion policy. Then any timeout / searcher removal system can be implemented by the user, rather than dictated by Solr.

          The described searcher management system belongs in a module in Lucene rather than Solr, probably in one of Mike's new classes.

          Show
          Jason Rutherglen added a comment - In RT the searchers are cheap. The easiest approach would be to record the segments and max doc ids used to satisfy phase 1 of a given distributed query, then send that signature back in subsequent phases. The retry would only be necessary in the infrequent case of a merge have occurred. With NRT it's probably best to implement a searcher policy similar to index deletion policy. Then any timeout / searcher removal system can be implemented by the user, rather than dictated by Solr. The described searcher management system belongs in a module in Lucene rather than Solr, probably in one of Mike's new classes.
          Hide
          Yonik Seeley added a comment -

          Distributed search already ensures that different phases of the same request go to the same replicas, so the remainder of this patch should be pretty easy.

          We just need a SolrQueryRequest.getSearcher() that checks the request parameters. If there is a reserve=1000 parameter, then the reference count is incremented an additional time, and is added to a map / priority queue (or whatever a separate timing thread looks at to decrement the ref after the lease is up), or updated if it already exists. A unique id for the searcher is returned in the request header.

          If the request params have "searcher=1294929223" then getSearcher() retrieves that version from the map if the lease hasn't expired (and applies a new lease if there is also a reserve param).

          This seems nice and isolated from the rest of the solr code... no need to modify any of the guts of the complex SolrCore.getSearcher()

          Show
          Yonik Seeley added a comment - Distributed search already ensures that different phases of the same request go to the same replicas, so the remainder of this patch should be pretty easy. We just need a SolrQueryRequest.getSearcher() that checks the request parameters. If there is a reserve=1000 parameter, then the reference count is incremented an additional time, and is added to a map / priority queue (or whatever a separate timing thread looks at to decrement the ref after the lease is up), or updated if it already exists. A unique id for the searcher is returned in the request header. If the request params have "searcher=1294929223" then getSearcher() retrieves that version from the map if the lease hasn't expired (and applies a new lease if there is also a reserve param). This seems nice and isolated from the rest of the solr code... no need to modify any of the guts of the complex SolrCore.getSearcher()
          Hide
          Robert Muir added a comment -

          The retry would only be necessary in the infrequent case of a merge have occurred.

          I don't know that this is always true: if we support distributed stats (idf etc), then this assumption could break down in a lot of cases, because the stats from the first phase don't make sense with regards to the documents being scored in the second phrase.

          with lucene's simple tf/idf there are probably less concerns, but if we are going to go this route we should consider how this will affect other scoring algorithms, especially ones that use 'global stats' for things like length normalization (BM25's avg dl, DFR, etc) versus just query-term importance.

          Show
          Robert Muir added a comment - The retry would only be necessary in the infrequent case of a merge have occurred. I don't know that this is always true: if we support distributed stats (idf etc), then this assumption could break down in a lot of cases, because the stats from the first phase don't make sense with regards to the documents being scored in the second phrase. with lucene's simple tf/idf there are probably less concerns, but if we are going to go this route we should consider how this will affect other scoring algorithms, especially ones that use 'global stats' for things like length normalization (BM25's avg dl, DFR, etc) versus just query-term importance.
          Hide
          Jason Rutherglen added a comment -

          if we support distributed stats (idf etc), then this assumption could break down in a lot of cases, because the stats from the first phase don't make sense with regards to the documents being scored in the second phrase.

          They would make sense. Though it's like discussing the wind since RT isn't completed. Sounds like you're thinking of a general retry? That's a good idea, it would need to retry the entire distributed query in all phases.

          That functionality should be modular and should not be 'pre-baked / canned' into Solr. [Again] a simple policy class would suffice here.

          Show
          Jason Rutherglen added a comment - if we support distributed stats (idf etc), then this assumption could break down in a lot of cases, because the stats from the first phase don't make sense with regards to the documents being scored in the second phrase. They would make sense. Though it's like discussing the wind since RT isn't completed. Sounds like you're thinking of a general retry? That's a good idea, it would need to retry the entire distributed query in all phases. That functionality should be modular and should not be 'pre-baked / canned' into Solr. [Again] a simple policy class would suffice here.
          Hide
          Robert Muir added a comment -

          Sounds like you're thinking of a general retry?

          I have no idea really, I was just bringing up the issue. We could always use algorithms like what you describe, you know I think we should cheat as much as possible for good performance (whatever that approach is) as long as we come up with tests to ensure it doesnt totally break down scoring (I'm not talking about relevance, but i don't want them returning NaN/Inf if they encounter a tf value > totalTermFreq stat from a previous phase)

          Show
          Robert Muir added a comment - Sounds like you're thinking of a general retry? I have no idea really, I was just bringing up the issue. We could always use algorithms like what you describe, you know I think we should cheat as much as possible for good performance (whatever that approach is) as long as we come up with tests to ensure it doesnt totally break down scoring (I'm not talking about relevance, but i don't want them returning NaN/Inf if they encounter a tf value > totalTermFreq stat from a previous phase)
          Hide
          Jason Rutherglen added a comment -

          no need to modify any of the guts of the complex SolrCore.getSearcher()

          That code should be removed entirely, better now in 4.x.

          The unique id per searcher idea would work, however it needs to also implement a retry when a given id no longer exists.

          Still, this would be best implemented in the context of rewriting distributed search and the getSearcher code. Otherwise is layering hacked up code on top of further hacked up code. It's a mess to debug and change later.

          Show
          Jason Rutherglen added a comment - no need to modify any of the guts of the complex SolrCore.getSearcher() That code should be removed entirely, better now in 4.x. The unique id per searcher idea would work, however it needs to also implement a retry when a given id no longer exists. Still, this would be best implemented in the context of rewriting distributed search and the getSearcher code. Otherwise is layering hacked up code on top of further hacked up code. It's a mess to debug and change later.
          Hide
          Yonik Seeley added a comment -

          We just need a SolrQueryRequest.getSearcher() that checks the request parameters. If there is a reserve [...]

          Actually, since a single local request already holds the searcher open for as long as it takes the request to complete, more useful semantics might be for "reserve" to be additional time after a request has completed (i.e. in SolrQueryRequest.close())

          Show
          Yonik Seeley added a comment - We just need a SolrQueryRequest.getSearcher() that checks the request parameters. If there is a reserve [...] Actually, since a single local request already holds the searcher open for as long as it takes the request to complete, more useful semantics might be for "reserve" to be additional time after a request has completed (i.e. in SolrQueryRequest.close())
          Hide
          Jason Rutherglen added a comment -

          SOLR-2778 is the issue that seeks to clean up and modularize the distributed search code.

          Show
          Jason Rutherglen added a comment - SOLR-2778 is the issue that seeks to clean up and modularize the distributed search code.
          Hide
          Michael McCandless added a comment -

          I've been working on a similar idea, as a standalone utility class, that records searchers keyed by their IR.getVersion().

          I think this is important not only for consistency within a single distributed search request, but also across requests when a user does a follow-on action (next page, drill down/up, etc.), especially now that we have searchAfter where the "last bottom" can easily shift.

          I'll open a new issue; I'm not sure we can combine the approaches but maybe...

          Show
          Michael McCandless added a comment - I've been working on a similar idea, as a standalone utility class, that records searchers keyed by their IR.getVersion(). I think this is important not only for consistency within a single distributed search request, but also across requests when a user does a follow-on action (next page, drill down/up, etc.), especially now that we have searchAfter where the "last bottom" can easily shift. I'll open a new issue; I'm not sure we can combine the approaches but maybe...
          Hide
          Yonik Seeley added a comment -

          I'm not sure we can combine the approaches but maybe...

          Yeah, Solr has had it's own searcher management since '04... I'm not sure it's practical to try and replace it all, esp when it seems so easy to build leases on top of that and not touch it at all.
          One thing that comes to mind that we could share is a general purpose class LeaseManager<Key,LeasedResource>
          that simply has a few methods like

          long reserve(LeasedResource resource, long ms)
          LeasedResource lookup(Key key)
          void clearAll()
          

          Instead of LeasedResource, perhaps it could just be LeaseManager<Key, Closeable>() as I think close() is the only thing that a LeaseManager would need to call?

          Show
          Yonik Seeley added a comment - I'm not sure we can combine the approaches but maybe... Yeah, Solr has had it's own searcher management since '04... I'm not sure it's practical to try and replace it all, esp when it seems so easy to build leases on top of that and not touch it at all. One thing that comes to mind that we could share is a general purpose class LeaseManager<Key,LeasedResource> that simply has a few methods like long reserve(LeasedResource resource, long ms) LeasedResource lookup(Key key) void clearAll() Instead of LeasedResource, perhaps it could just be LeaseManager<Key, Closeable>() as I think close() is the only thing that a LeaseManager would need to call?
          Hide
          Michael McCandless added a comment -

          OK I opened LUCENE-3486 w/ my current patch.

          Show
          Michael McCandless added a comment - OK I opened LUCENE-3486 w/ my current patch.

            People

            • Assignee:
              Unassigned
              Reporter:
              Yonik Seeley
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development