Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8906

Make transient core cache pluggable.

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.6, 7.0
    • Component/s: None
    • Labels:
      None

      Description

      The current Lazy Core stuff is pretty deeply intertwined in CoreContainer. Adding and removing active cores is based on a simple LRU mechanism, but keeping the right cores in the right internal structures involves a lot of attention to locking various objects to update internal structures. This makes it difficult/dangerous to use any other caching algorithms.

      Any single age-out algorithm will have non-optimal access patterns, so making this pluggable would allow better algorithms to be substituted in those cases.

      If we ever extend transient cores to SolrCloud, we need to have load/unload decisions that are cloud-aware rather then entirely local so in that sense this is would lay some groundwork if we ever want to go there.

      So I'm going to try to hack together a PoC. Any ideas on the most sensible pattern for this gratefully received.

      1. SOLR-8906.patch
        53 kB
        Erick Erickson
      2. SOLR-8906.patch
        64 kB
        Erick Erickson
      3. SOLR-8906.patch
        53 kB
        Erick Erickson
      4. SOLR-8906.patch
        44 kB
        Erick Erickson
      5. SOLR-8906.patch
        26 kB
        Erick Erickson

        Activity

        Hide
        ben.manes Ben Manes added a comment -

        Not sure if it helps, but there's discussion of using TinyLFU instead of LRU / LFU for the SolrCache (SOLR-8241). That library could be used instead of LRU here too to evict based on recency and frequency. From my reading of transientCores that appears to be a simple migration.

        Show
        ben.manes Ben Manes added a comment - Not sure if it helps, but there's discussion of using TinyLFU instead of LRU / LFU for the SolrCache ( SOLR-8241 ). That library could be used instead of LRU here too to evict based on recency and frequency. From my reading of transientCores that appears to be a simple migration.
        Hide
        erickerickson Erick Erickson added a comment -

        Thanks, I'll take a look. So far I've hacked an "interface" that is just a subset of LinkedHashMap to see if it could be made pluggable. Creating an abstraction that's more generalized is high on the priority list, and working with better cache implementations may make sense.

        The biggest gain would be from not unloading caches that were likely to be used again. The simple LRU implementation suffers from implementation patterns that periodically access each entry exactly once (say for health check or some such). Having the frequency of access incorporated into the eviction policy would be A Good Thing.

        That said, I haven't looked at all at the implementation of TinyLFU, and probably won't get to it for a week or two so don't think I'm totally dropping the ball in the meanwhile. Thanks for bringing it to my attention!

        Show
        erickerickson Erick Erickson added a comment - Thanks, I'll take a look. So far I've hacked an "interface" that is just a subset of LinkedHashMap to see if it could be made pluggable. Creating an abstraction that's more generalized is high on the priority list, and working with better cache implementations may make sense. The biggest gain would be from not unloading caches that were likely to be used again. The simple LRU implementation suffers from implementation patterns that periodically access each entry exactly once (say for health check or some such). Having the frequency of access incorporated into the eviction policy would be A Good Thing. That said, I haven't looked at all at the implementation of TinyLFU, and probably won't get to it for a week or two so don't think I'm totally dropping the ball in the meanwhile. Thanks for bringing it to my attention!
        Hide
        ben.manes Ben Manes added a comment -

        TinyLFU is scan resistant (see Glimpse trace). For implementation details a nice overview is provided in the HighScalability article.

        Show
        ben.manes Ben Manes added a comment - TinyLFU is scan resistant (see Glimpse trace ). For implementation details a nice overview is provided in the HighScalability article .
        Hide
        erickerickson Erick Erickson added a comment -

        So I did hack together a PoC and it doesn't disrupt CoreContainer too much. I'm not really ready to put it up since it's too crude. The "interface" is just selected interfaces from LinkedHashMap for instance, but it works enough to decouple internal locking of objects in CoreContainer from the plugin code which was my first concern.

        Thinking about this some more, I started asking myself why should only transient cores be manipulated by the plugin? CoreContainer shouldn't really need to care whether the core is transient or not for its purposes. Gotta think about that some more. Once the state of the core is removed from being so intertwined with CoreContainer, it seems like it would be adaptable to using ZK as "the one source of truth" pretty easily...

        Show
        erickerickson Erick Erickson added a comment - So I did hack together a PoC and it doesn't disrupt CoreContainer too much. I'm not really ready to put it up since it's too crude. The "interface" is just selected interfaces from LinkedHashMap for instance, but it works enough to decouple internal locking of objects in CoreContainer from the plugin code which was my first concern. Thinking about this some more, I started asking myself why should only transient cores be manipulated by the plugin? CoreContainer shouldn't really need to care whether the core is transient or not for its purposes. Gotta think about that some more. Once the state of the core is removed from being so intertwined with CoreContainer, it seems like it would be adaptable to using ZK as "the one source of truth" pretty easily...
        Hide
        noble.paul Noble Paul added a comment -

        lazy cores itself is a vestige a of the old master-slave model where cores were not expected to be up.

        So, this is an X-Y problem. Let's ask the question , why do we want to unload a core?

        We just need to ensure that the resources held by a core is kept minimal. The expensive resources are file handles & caches.(there could be others and we can ignore them for a while). So, if we manage to free up these resources for the unused core we can pretty much achieve our objective.

        Show
        noble.paul Noble Paul added a comment - lazy cores itself is a vestige a of the old master-slave model where cores were not expected to be up. So, this is an X-Y problem. Let's ask the question , why do we want to unload a core? We just need to ensure that the resources held by a core is kept minimal. The expensive resources are file handles & caches.(there could be others and we can ignore them for a while). So, if we manage to free up these resources for the unused core we can pretty much achieve our objective.
        Hide
        erickerickson Erick Erickson added a comment -

        bq: why do we want to unload a core?

        Well, I know of at least one situation where people have implemented auto-scaling that can do things like:
        > split an index in the background. More generally maintain indexes of size no larger than X. Then move one or more of the splits "someplace else".
        > move a core's index to SSD while it's hot.
        > re-partition user's data based on some heuristics without downtime.
        > any situation where a core is manipulated outside Solr and still needs to service requests in the mean time.

        So I do wonder if we can stand the question on its head. Rather than think of it as the transient cores being an afterthought, what if we move all core management to a plugin? NOTE: this is really fuzzy ATM, just askin'.

        Then we wouldn't have the distinction in solr of "transient", "lazy loading", "regular" deeply embedded in the CoreContainer & etc. code. Even in the case where we open/close the heavyweight objects rather than load/unload cores, we still have to maintain lists of what cores have searchers already open and the like, similar to what happens in transient cores. Does it make any sense to think of moving all core management to a (suitably modified) transient core plugin? Then the default implementation we provide would just manage the heavyweight objects rather than load/unload cores and others could do as they wished.

        Going forward, when everything is SolrCloud, there would be a degenerate case of leader-only collections that could essentially be treated as we do the current standalone code I'd guess.

        bq: lazy cores itself is a vestige a of the old master-slave model

        Not at all sure I agree. Even when SolrCloud rules the world, there'll always be edge cases where some organization pushes the limit. I don't want to keep Solr from evolving just to accommodate these edge cases, but I also don't want to prematurely decide for them that "we can do it better". 'cause we can't in situation N+1.

        Oh, and let's keep a distinction between "lazy" and "transient" cores. "lazy" just means it isn't loaded until it's called for, it can be permanent after that. "transient" is the whole cache-and-load/unload-when-needed bit. Don't quite know how those will reconcile going forward, but the idea of opening/closing heavyweight objects is still "lazy" cores in some sense.

        Show
        erickerickson Erick Erickson added a comment - bq: why do we want to unload a core? Well, I know of at least one situation where people have implemented auto-scaling that can do things like: > split an index in the background. More generally maintain indexes of size no larger than X. Then move one or more of the splits "someplace else". > move a core's index to SSD while it's hot. > re-partition user's data based on some heuristics without downtime. > any situation where a core is manipulated outside Solr and still needs to service requests in the mean time. So I do wonder if we can stand the question on its head. Rather than think of it as the transient cores being an afterthought, what if we move all core management to a plugin? NOTE: this is really fuzzy ATM, just askin'. Then we wouldn't have the distinction in solr of "transient", "lazy loading", "regular" deeply embedded in the CoreContainer & etc. code. Even in the case where we open/close the heavyweight objects rather than load/unload cores, we still have to maintain lists of what cores have searchers already open and the like, similar to what happens in transient cores. Does it make any sense to think of moving all core management to a (suitably modified) transient core plugin? Then the default implementation we provide would just manage the heavyweight objects rather than load/unload cores and others could do as they wished. Going forward, when everything is SolrCloud, there would be a degenerate case of leader-only collections that could essentially be treated as we do the current standalone code I'd guess. bq: lazy cores itself is a vestige a of the old master-slave model Not at all sure I agree. Even when SolrCloud rules the world, there'll always be edge cases where some organization pushes the limit. I don't want to keep Solr from evolving just to accommodate these edge cases, but I also don't want to prematurely decide for them that "we can do it better". 'cause we can't in situation N+1. Oh, and let's keep a distinction between "lazy" and "transient" cores. "lazy" just means it isn't loaded until it's called for, it can be permanent after that. "transient" is the whole cache-and-load/unload-when-needed bit. Don't quite know how those will reconcile going forward, but the idea of opening/closing heavyweight objects is still "lazy" cores in some sense.
        Hide
        erickerickson Erick Erickson added a comment -

        Here's a patch that moves all the transient core processing to a plugin. If nothing else, changing the code to this model will move us toward a place where we can isolate the core handling to a plugin...

        I won't commit this for a while, certainly not until after 6.5 is cut. Between now and then we can debate.

        Show
        erickerickson Erick Erickson added a comment - Here's a patch that moves all the transient core processing to a plugin. If nothing else, changing the code to this model will move us toward a place where we can isolate the core handling to a plugin... I won't commit this for a while, certainly not until after 6.5 is cut. Between now and then we can debate.
        Hide
        erickerickson Erick Erickson added a comment -

        Oops, this one has the newly-added files.

        Show
        erickerickson Erick Erickson added a comment - Oops, this one has the newly-added files.
        Hide
        erickerickson Erick Erickson added a comment -

        I think this is close to ready, I'll probably commit it this week sometime.

        Show
        erickerickson Erick Erickson added a comment - I think this is close to ready, I'll probably commit it this week sometime.
        Hide
        erickerickson Erick Erickson added a comment -

        As he slowly learns how to use git to create a proper patch that incorporates more than one local commit....

        Show
        erickerickson Erick Erickson added a comment - As he slowly learns how to use git to create a proper patch that incorporates more than one local commit....
        Hide
        erickerickson Erick Erickson added a comment -

        Final patch with CHANGES.txt.

        Show
        erickerickson Erick Erickson added a comment - Final patch with CHANGES.txt.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 52632cfc0c0c945cff2e769e6c2dc4dc9a5da400 in lucene-solr's branch refs/heads/master from Erick Erickson
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=52632cf ]

        SOLR-8906: Make transient core cache pluggable

        Show
        jira-bot ASF subversion and git services added a comment - Commit 52632cfc0c0c945cff2e769e6c2dc4dc9a5da400 in lucene-solr's branch refs/heads/master from Erick Erickson [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=52632cf ] SOLR-8906 : Make transient core cache pluggable
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2ca7e7ec490d4c891b27d61cbf8696d4c4dc6953 in lucene-solr's branch refs/heads/branch_6x from Erick Erickson
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2ca7e7e ]

        SOLR-8906: Make transient core cache pluggable

        (cherry picked from commit 52632cf)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2ca7e7ec490d4c891b27d61cbf8696d4c4dc6953 in lucene-solr's branch refs/heads/branch_6x from Erick Erickson [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2ca7e7e ] SOLR-8906 : Make transient core cache pluggable (cherry picked from commit 52632cf)

          People

          • Assignee:
            erickerickson Erick Erickson
            Reporter:
            erickerickson Erick Erickson
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development