Solr
  1. Solr
  2. SOLR-5795

Option to periodically delete docs based on an expiration field -- or ttl specified when indexed.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      A question I get periodically from people is how to automatically remove documents from a collection at a certain time (or after a certain amount of time).

      Excluding from search results using a filter query on a date field is trivial, but you still have to periodically send a deleteByQuery to clean up those older "expired" documents. And in the case where you want all documents to auto-expire some fixed amount of time when they were indexed, you still have to setup a simple UpdateProcessorto set that expiration date. So i've been thinking it would be nice if there was a simple way to configure solr to do it all for you.

      1. SOLR-5795.patch
        49 kB
        Hoss Man
      2. SOLR-5795.patch
        47 kB
        Hoss Man
      3. SOLR-5795.patch
        44 kB
        Hoss Man
      4. SOLR-5795.patch
        34 kB
        Hoss Man
      5. SOLR-5795.patch
        24 kB
        Hoss Man
      6. SOLR-5795.patch
        11 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Here's the basic design i've been fleshing out in my head...

          • A new "ExpireDocsUpdateProcessorFactory"
            • can compute an expiration field to add to indexed docs based on a "ttl" field in the input doc
              • perhaps could also fallback to a ttl update request param when bulk adding similar to _version_ ?
              • IgnoreFieldUpdateProcessorFactory could be used to remove the ttl if they don't wnat a record in the index of when/why expiration_date was computed
            • Can trigger periodic deleteByQuery on expiration time field
          • rough idea for configuration...
            <processor class="solr.ExpireDocsUpdateProcessorFactory">
              <!-- mandatory, must be a date based field in schema.xml -->
              <str name="expiration.fieldName">expire_at</str>
              <!-- optional, default is not to auto-expire docs -->
              <int name="deleteIntervalInSeconds">300</int>
              <!-- optional, default is not to compute expiration automatically 
                   if this field doesn't exist in schema, then IgnoreFieldUpdateProcessorFactory can be configured to remove it.
                -->
              <str name="ttl.fieldName">ttl</str>
            </process>
            
          • ExpireDocsUpdateProcessorFactory.init() logic:
            • if ttl.fieldName is specified make a note of it
            • validate expiration.fieldName is set & exists in schema
              • perhaps in managed schema mode create automatically if it doesn't?
            • if deleteIntervalInSeconds is set:
              • spin up a recurring ScheduledThreadPoolExecutor with a recurring AutoExpireDocsCallable
              • add a core Shutdown hook to shutdown the executor when the core shuts down
          • ExpireDocsUpdateProcessor.processAdd() logic:
            • if ttl.fieldName is configured & doc contains that field name:
              • treat value as datemath from NOW and put computed value in expiration.fieldName
            • else: No-Op
          • AutoExpireDocsCallable logic:
            • if cloud mode, return No-Op unless we are running on the overseer
            • Create a DeleteUpdateCommand using deleteByQuery of [* TO NOW] using the expiration.fieldName
              • this can be fired directly against the UpdateRequestProcessor returned by the ExpireDocsUpdateProcessorFactory itself using a LocalSolrQueryRequest
                • Or perhaps we make an optional configuration so you can specify any chain name and we fetch it from the SolrCore?
              • the existing distributed delete logic should ensure it gets distributed cleanly in cloud mode
              • NOTE: the executor should run on every node, and only do the overseer check when the executor fires, so even when the overseer changes periodically, whoever the current overseer is every X minutes will fire the delete.

          This, combined with things like DefaultValueUpdateProcessorFactory, IgnoreFieldUpdateProcessorFactory and FirstFieldValueUpdateProcessorFactory on the ttl.fieldName and/or expiration.fieldName should allow all sorts of various usecases:

          • every doc expires after X amount of time no matter what the client says
          • every doc defaults to an ttl of X unless it has a doc explicit ttl
          • every doc defaults to an ttl of X unless it has a doc explicit expire date
          • docs can optional expire after a ttl specified when they were indexed
          • docs can optional expire at an explicit time specified when they were indexed
          Show
          Hoss Man added a comment - Here's the basic design i've been fleshing out in my head... A new " ExpireDocsUpdateProcessorFactory " can compute an expiration field to add to indexed docs based on a " ttl " field in the input doc perhaps could also fallback to a ttl update request param when bulk adding similar to _version_ ? IgnoreFieldUpdateProcessorFactory could be used to remove the ttl if they don't wnat a record in the index of when/why expiration_date was computed Can trigger periodic deleteByQuery on expiration time field rough idea for configuration... <processor class= "solr.ExpireDocsUpdateProcessorFactory" > <!-- mandatory, must be a date based field in schema.xml --> <str name= "expiration.fieldName" >expire_at</str> <!-- optional, default is not to auto-expire docs --> < int name= "deleteIntervalInSeconds" >300</ int > <!-- optional, default is not to compute expiration automatically if this field doesn't exist in schema, then IgnoreFieldUpdateProcessorFactory can be configured to remove it. --> <str name= "ttl.fieldName" >ttl</str> </process> ExpireDocsUpdateProcessorFactory.init() logic: if ttl.fieldName is specified make a note of it validate expiration.fieldName is set & exists in schema perhaps in managed schema mode create automatically if it doesn't? if deleteIntervalInSeconds is set: spin up a recurring ScheduledThreadPoolExecutor with a recurring AutoExpireDocsCallable add a core Shutdown hook to shutdown the executor when the core shuts down ExpireDocsUpdateProcessor.processAdd() logic: if ttl.fieldName is configured & doc contains that field name: treat value as datemath from NOW and put computed value in expiration.fieldName else: No-Op AutoExpireDocsCallable logic: if cloud mode, return No-Op unless we are running on the overseer Create a DeleteUpdateCommand using deleteByQuery of [* TO NOW] using the expiration.fieldName this can be fired directly against the UpdateRequestProcessor returned by the ExpireDocsUpdateProcessorFactory itself using a LocalSolrQueryRequest Or perhaps we make an optional configuration so you can specify any chain name and we fetch it from the SolrCore? the existing distributed delete logic should ensure it gets distributed cleanly in cloud mode NOTE: the executor should run on every node, and only do the overseer check when the executor fires, so even when the overseer changes periodically, whoever the current overseer is every X minutes will fire the delete. This, combined with things like DefaultValueUpdateProcessorFactory , IgnoreFieldUpdateProcessorFactory and FirstFieldValueUpdateProcessorFactory on the ttl.fieldName and/or expiration.fieldName should allow all sorts of various usecases: every doc expires after X amount of time no matter what the client says every doc defaults to an ttl of X unless it has a doc explicit ttl every doc defaults to an ttl of X unless it has a doc explicit expire date docs can optional expire after a ttl specified when they were indexed docs can optional expire at an explicit time specified when they were indexed
          Hide
          Hoss Man added a comment -

          The one hitch with this idea – which is already a problem if you do the same logic from an external client – is that ass things stand today, if you do a lot of periodic deleteByQuery commands with auto-commit, every one will cause a new searcher to be opened, even if nothing was actually deleted – but it looks like we can fix that independently in SOLR-5783.

          I'm going to tackle the design i laid out here once I get SOLR-5783 in shape with enough tests that i'm comfortable committing.

          Show
          Hoss Man added a comment - The one hitch with this idea – which is already a problem if you do the same logic from an external client – is that ass things stand today, if you do a lot of periodic deleteByQuery commands with auto-commit, every one will cause a new searcher to be opened, even if nothing was actually deleted – but it looks like we can fix that independently in SOLR-5783 . I'm going to tackle the design i laid out here once I get SOLR-5783 in shape with enough tests that i'm comfortable committing.
          Hide
          Steven Bower added a comment -

          The idea makes sense but this seems to me like a horrible feature to add... Content should never be removed without explicit external interactions and this will lead to so many "where did my content go" type problems.. Especially since once its gone from the index debugging what went wrong is not going to be easy.. writing a script to send a query delete periodically is really not that complex and then it becomes the responsibility of the content owner/developer to delete content..

          I would suggest that if this does go in some sort of "audit" output be produced (eg X docs delete automatically or a list of ids)

          Also per this design both the exp and ttl fields must be required if specified in the config else mayhem..

          Show
          Steven Bower added a comment - The idea makes sense but this seems to me like a horrible feature to add... Content should never be removed without explicit external interactions and this will lead to so many "where did my content go" type problems.. Especially since once its gone from the index debugging what went wrong is not going to be easy.. writing a script to send a query delete periodically is really not that complex and then it becomes the responsibility of the content owner/developer to delete content.. I would suggest that if this does go in some sort of "audit" output be produced (eg X docs delete automatically or a list of ids) Also per this design both the exp and ttl fields must be required if specified in the config else mayhem..
          Hide
          Jan Høydahl added a comment -

          Duplicate of SOLR-3874

          Apart from that, I think this makes sense to have in a URP like Chris Hostetter (Unused) suggests

          Show
          Jan Høydahl added a comment - Duplicate of SOLR-3874 Apart from that, I think this makes sense to have in a URP like Chris Hostetter (Unused) suggests
          Hide
          Grant Ingersoll added a comment -

          Steven Bower Time To Live is a pretty common feature of data platforms. The "explicit external interaction" that you mentioned, in my mind, is the user/application setting up a TTL for a document, it just happens that the event lives in the future. It is also quite common use case in compliance situations and applications which are searching "low value" data and you want to clean up old data periodically.

          +1, however, on the audit option.

          Show
          Grant Ingersoll added a comment - Steven Bower Time To Live is a pretty common feature of data platforms. The "explicit external interaction" that you mentioned, in my mind, is the user/application setting up a TTL for a document, it just happens that the event lives in the future. It is also quite common use case in compliance situations and applications which are searching "low value" data and you want to clean up old data periodically. +1, however, on the audit option.
          Hide
          Hoss Man added a comment -

          Content should never be removed without explicit external interactions and this will lead to so many "where did my content go" type problems.. Especially since once its gone from the index debugging what went wrong is not going to be easy.. writing a script to send a query delete periodically is really not that complex and then it becomes the responsibility of the content owner/developer to delete content..

          I'm not sure i follow your reasoning there – "where did my content go" type situations can already exist via any deleteByQuery (not to mention really subtle things like SignatureUpdateProcessorFactory). If anything the approach i'm suggesting should be more obvious then an external script – because it would need to be configured right there in the solrconfig.xml where it's obvious and easy to see, as opposed to "where did my content go? ... time to wade through days of logs looking for deleteByQuey requests that could be coming from anywhere, at any interval of time."

          The bottom line, is that someone with the ability to edit solrconfig.xml already has the ability to trump & manipulate & block & mess with content sent from remote clients by content owners/developers – this would in fact be another way to do that, but i don't think that's a bad thing. It would just be a simpler, self contained, way for solr admins to say "I want to have a way to automatically expire content that people put in my index"

          I would suggest that if this does go in some sort of "audit" output be produced (eg X docs delete automatically or a list of ids)

          that would be really nice in general with any sort of deleteByQuery – but it's not currently possible to get that info back from the IndexWriter. The best we can do is explicitly log when/why we are triggering the automatic deleteByQuery commands


          I'm attaching a patch with a really rough proof of concept for the design outlined above ... still a lot of nocommits & error checking & tests needed, but it gives you something to try out to see what I had in mind.

          With this patch applied, you can startup the example and load docs along the lines of this...

          java -Durl="http://localhost:8983/solr/collection1/update?update.chain=nocommit" -Ddata=args -jar post.jar '<add><doc><field name="id">EXP</field><field name="_expire_at_">NOW+8MINUTES</field></doc><doc><field name="id">SAFE</field></doc><doc><field name="id">TTL</field><field name="_ttl_">+3MINUTES</field></doc></add>'
          
          • Every 5 minutes, a thread will wake up and delete docs
          • EXP has an explicit value in the _expire_at_ of 8 minutes after it was indexed – if you index the docs immediatley after starting up Solr, it should be deleted ~10Minutes after startup.
          • TTL has an implicit _ttl_ value of 3 minutes after it was indexed, which the processor converts to an absolute value and puts in the _expire_at_ – if you index the docs immediatley after starting up Solr, it should be deleted ~5Minutes after startup.
          • SAFE will never be deleted, because nothing gives it a value in the _expire_at_ field.

          One note where we definitely have to deviate from what i described initially: having hte scheduled task use the factory to access the chain to trigger the delete didn't pan out because i wasn't thinking clearly enough about what that existing API looks like – the factory doesn't know what chain it's in, or what processor(s) should be "next", that's input to getInstance() method on the factory from the chain. so instead the configuration requires you to specify the name of a chain (can be the same chain you are in) and that chain is used to execute the delete.

          (The trickiest part of all of this, will be writing the tests)

          Show
          Hoss Man added a comment - Content should never be removed without explicit external interactions and this will lead to so many "where did my content go" type problems.. Especially since once its gone from the index debugging what went wrong is not going to be easy.. writing a script to send a query delete periodically is really not that complex and then it becomes the responsibility of the content owner/developer to delete content.. I'm not sure i follow your reasoning there – "where did my content go" type situations can already exist via any deleteByQuery (not to mention really subtle things like SignatureUpdateProcessorFactory ). If anything the approach i'm suggesting should be more obvious then an external script – because it would need to be configured right there in the solrconfig.xml where it's obvious and easy to see, as opposed to "where did my content go? ... time to wade through days of logs looking for deleteByQuey requests that could be coming from anywhere, at any interval of time." The bottom line, is that someone with the ability to edit solrconfig.xml already has the ability to trump & manipulate & block & mess with content sent from remote clients by content owners/developers – this would in fact be another way to do that, but i don't think that's a bad thing. It would just be a simpler, self contained, way for solr admins to say "I want to have a way to automatically expire content that people put in my index" I would suggest that if this does go in some sort of "audit" output be produced (eg X docs delete automatically or a list of ids) that would be really nice in general with any sort of deleteByQuery – but it's not currently possible to get that info back from the IndexWriter. The best we can do is explicitly log when/why we are triggering the automatic deleteByQuery commands I'm attaching a patch with a really rough proof of concept for the design outlined above ... still a lot of nocommits & error checking & tests needed, but it gives you something to try out to see what I had in mind. With this patch applied, you can startup the example and load docs along the lines of this... java -Durl="http://localhost:8983/solr/collection1/update?update.chain=nocommit" -Ddata=args -jar post.jar '<add><doc><field name="id">EXP</field><field name="_expire_at_">NOW+8MINUTES</field></doc><doc><field name="id">SAFE</field></doc><doc><field name="id">TTL</field><field name="_ttl_">+3MINUTES</field></doc></add>' Every 5 minutes, a thread will wake up and delete docs EXP has an explicit value in the _expire_at_ of 8 minutes after it was indexed – if you index the docs immediatley after starting up Solr, it should be deleted ~10Minutes after startup. TTL has an implicit _ttl_ value of 3 minutes after it was indexed, which the processor converts to an absolute value and puts in the _expire_at_ – if you index the docs immediatley after starting up Solr, it should be deleted ~5Minutes after startup. SAFE will never be deleted, because nothing gives it a value in the _expire_at_ field. One note where we definitely have to deviate from what i described initially: having hte scheduled task use the factory to access the chain to trigger the delete didn't pan out because i wasn't thinking clearly enough about what that existing API looks like – the factory doesn't know what chain it's in, or what processor(s) should be "next", that's input to getInstance() method on the factory from the chain. so instead the configuration requires you to specify the name of a chain (can be the same chain you are in) and that chain is used to execute the delete. (The trickiest part of all of this, will be writing the tests)
          Hide
          Hoss Man added a comment -

          some baseline tests (including a watcher for the periodic delete commands) and fleshing out some of the config validation.

          I tweaked the name of the config args to try and make it more in your face that you were enabling an "automatic deleting". From the test configs...

            <updateRequestProcessorChain name="convert-ttl">
              <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
                <str name="ttlFieldName">_ttl_</str>
                <str name="expirationFieldName">_expire_at_</str>
              </processor>
              <processor class="solr.IgnoreFieldUpdateProcessorFactory">
                <str name="fieldName">_ttl_</str>
              </processor>
            </updateRequestProcessorChain>
          
            <updateRequestProcessorChain name="scheduled-delete" default="true">
              <!-- NOTE: this chain is default so we can see that
                   autoDeleteChainName defaults to the default chain for the SolrCore
              -->
              <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
                <!-- str name="autoDeleteChainName">scheduled-delete</str -->
                <int name="autoDeletePeriodSeconds">3</int>
                <str name="expirationFieldName">eXpF</str>
              </processor>
              <processor class="solr.RecordingUpdateProcessorFactory" />
            </updateRequestProcessorChain>
          
          Show
          Hoss Man added a comment - some baseline tests (including a watcher for the periodic delete commands) and fleshing out some of the config validation. I tweaked the name of the config args to try and make it more in your face that you were enabling an "automatic deleting". From the test configs... <updateRequestProcessorChain name= "convert-ttl" > <processor class= "solr.processor.DocExpirationUpdateProcessorFactory" > <str name= "ttlFieldName" >_ttl_</str> <str name= "expirationFieldName" >_expire_at_</str> </processor> <processor class= "solr.IgnoreFieldUpdateProcessorFactory" > <str name= "fieldName" >_ttl_</str> </processor> </updateRequestProcessorChain> <updateRequestProcessorChain name= "scheduled-delete" default = " true " > <!-- NOTE: this chain is default so we can see that autoDeleteChainName defaults to the default chain for the SolrCore --> <processor class= "solr.processor.DocExpirationUpdateProcessorFactory" > <!-- str name= "autoDeleteChainName" >scheduled-delete</str --> < int name= "autoDeletePeriodSeconds" >3</ int > <str name= "expirationFieldName" >eXpF</str> </processor> <processor class= "solr.RecordingUpdateProcessorFactory" /> </updateRequestProcessorChain>
          Hide
          Hoss Man added a comment -

          Update patch: Minor improvements to the code, but a whole new cloud based test has been added.

          The "run only on overseer" logic still is the biggest piece of functionality that still needs implemented, because I can't seem to find anyway for code to know if it's the overseer – i spun that off into blocker SOLR-5823 since it might be meaty in it's own right, and will start looking into that next before i wory too much about polishing what's here.

          Show
          Hoss Man added a comment - Update patch: Minor improvements to the code, but a whole new cloud based test has been added. The "run only on overseer" logic still is the biggest piece of functionality that still needs implemented, because I can't seem to find anyway for code to know if it's the overseer – i spun that off into blocker SOLR-5823 since it might be meaty in it's own right, and will start looking into that next before i wory too much about polishing what's here.
          Hide
          Hoss Man added a comment -

          Updated patch:

          • javadocs
          • refactor some redundent code
          • add support for configuring a "ttlParamName" that can be used instead of (or as a default to) the "ttlFieldName"
          • add scafolding for the "only run on overser" logic (waiting for SOLR-5823)

          There's still some TODOs but nothing that I think should be a blocker, just room for improvement and/or additional configuration.


          Unfortunately, when i tried testing this in combination with SOLR-5823 (so only the overseer triggers the periodic deletes) the distrib test failed repeatedly – it timed out waiting for the doc to be deleted and it never was. I spent a bit of time looking through the logs, and i can't make sense of it:

          • the overseer logic seemed to be working, periodic deletes were being logged from one node, but other nodes just logged once that they weren't hte overseer and weren't going to manage the deletes
          • the deleteByQuery commands seemed to be getting forwarded – i was seeing deleteByQuery that had TOLEADER and FROMLEADER params getting logged.
          • likewise the commit commands also seemed to be getting forwared

          ...and yet still, the query loop for the doc that should be expired continously got numFound=1

          I'll dig in more tomorrow with fresh eyes.

          in the meantime: feedback on teh patch – particularly the javadocs even if folks don't want to wade into the code – would be appreciated.

          Show
          Hoss Man added a comment - Updated patch: javadocs refactor some redundent code add support for configuring a "ttlParamName" that can be used instead of (or as a default to) the "ttlFieldName" add scafolding for the "only run on overser" logic (waiting for SOLR-5823 ) There's still some TODOs but nothing that I think should be a blocker, just room for improvement and/or additional configuration. Unfortunately, when i tried testing this in combination with SOLR-5823 (so only the overseer triggers the periodic deletes) the distrib test failed repeatedly – it timed out waiting for the doc to be deleted and it never was. I spent a bit of time looking through the logs, and i can't make sense of it: the overseer logic seemed to be working, periodic deletes were being logged from one node, but other nodes just logged once that they weren't hte overseer and weren't going to manage the deletes the deleteByQuery commands seemed to be getting forwarded – i was seeing deleteByQuery that had TOLEADER and FROMLEADER params getting logged. likewise the commit commands also seemed to be getting forwared ...and yet still, the query loop for the doc that should be expired continously got numFound=1 I'll dig in more tomorrow with fresh eyes. in the meantime: feedback on teh patch – particularly the javadocs even if folks don't want to wade into the code – would be appreciated.
          Hide
          Noble Paul added a comment -

          IMO there should be a default field name for ttl say _ttl even if no field name is specified

          Show
          Noble Paul added a comment - IMO there should be a default field name for ttl say _ttl even if no field name is specified
          Hide
          Upayavira added a comment -

          This looks very useful. Would it be possible, however, to set an active field to false instead of deleting?? Or set an expired field to true.

          Show
          Upayavira added a comment - This looks very useful. Would it be possible, however, to set an active field to false instead of deleting?? Or set an expired field to true.
          Hide
          Noble Paul added a comment -

          Upayavira There already is a field in the doc which marks the expiry time. So the same field can be used right? An option to not delete the docs (don't run the scheduler) . But at query time you should be able to retrieve the deleted docs with an extra filter/flag

          Show
          Noble Paul added a comment - Upayavira There already is a field in the doc which marks the expiry time. So the same field can be used right? An option to not delete the docs (don't run the scheduler) . But at query time you should be able to retrieve the deleted docs with an extra filter/flag
          Hide
          Upayavira added a comment -

          Noble Paul: Yes, if the UpdateProcessor is converting a TTL into an exact time, then it is possible to filter on that exact time, and thus retrieve (un)deleted docs, which is what I was trying to get at, so you are correct, this should be possible already

          Show
          Upayavira added a comment - Noble Paul: Yes, if the UpdateProcessor is converting a TTL into an exact time, then it is possible to filter on that exact time, and thus retrieve (un)deleted docs, which is what I was trying to get at, so you are correct, this should be possible already
          Hide
          Hoss Man added a comment -

          the deleteByQuery commands seemed to be getting forwarded – i was seeing deleteByQuery that had TOLEADER and FROMLEADER params getting logged.

          I'm not sure what i was looking at before, but after digging into the code a lot more i realized that the only deletes i were seeing happen where happening on the control server – which it turns out, was acting as the overseer (see SOLR-5919) ... none of the replicas of the test collection were acting sa the overseer, so nothing was doing periodic deletes in the test collection.

          basically, when i laid out my desin for dealing with cloud, i was being silly-stupid...

          if cloud mode, return No-Op unless we are running on the overseer

          ...because there is no garuntee that the overseer node will be hosting a core for every cllection – you might have 1000 nodes in your cluster, and "collection47" might only be using cores on 10 of those nodes – that's a 1/100 chance that any of the nodes collection47 will be on the overseer.

          So i'm going to need to step back and rethink a way of ensuring that the distributed deletes happen, but don't happen on every node and flood the whole collection with N**2 delete requests. (possibly by using a micro "LeaderElection" just for this purpose? constrained to the existing shard leaders? or use a best-guess hueristic about the shard leaders? - it's not the end of hte world to have some redundent deletes, we just don't want it to be exponential)

          Show
          Hoss Man added a comment - the deleteByQuery commands seemed to be getting forwarded – i was seeing deleteByQuery that had TOLEADER and FROMLEADER params getting logged. I'm not sure what i was looking at before, but after digging into the code a lot more i realized that the only deletes i were seeing happen where happening on the control server – which it turns out, was acting as the overseer (see SOLR-5919 ) ... none of the replicas of the test collection were acting sa the overseer, so nothing was doing periodic deletes in the test collection. basically, when i laid out my desin for dealing with cloud, i was being silly-stupid... if cloud mode, return No-Op unless we are running on the overseer ...because there is no garuntee that the overseer node will be hosting a core for every cllection – you might have 1000 nodes in your cluster, and "collection47" might only be using cores on 10 of those nodes – that's a 1/100 chance that any of the nodes collection47 will be on the overseer. So i'm going to need to step back and rethink a way of ensuring that the distributed deletes happen, but don't happen on every node and flood the whole collection with N**2 delete requests. (possibly by using a micro "LeaderElection" just for this purpose? constrained to the existing shard leaders? or use a best-guess hueristic about the shard leaders? - it's not the end of hte world to have some redundent deletes, we just don't want it to be exponential)
          Hide
          Hoss Man added a comment -

          Ok - overseer is dead, long live the overseer!

          After investigating some different options for preventing too many nodes from triggering redundent deletes, what i came up with is this...

          In simple standalone instalations this method always returns true, but in cloud mode it will be true if and only if we are currently the leader of the (active) slice with the first name (lexigraphically).

          I outlined the reasoning why I think this is the most straightforward solution in the code...

              // This is a lot simpler then doing our own "leader" election across all replicas 
              // of all shards since:
              //   a) we already have a per shard leader
              //   b) shard names must be unique
              //   c) ClusterState is already being "watched" by ZkController, no additional zk hits
              //   d) there might be multiple instances of this factory (in multiple chains) per 
              //      collection, so picking an ephemeral node name for our election would be tricky
          

          Watching the logs when running the tests, things look pretty good, and seem to be operating as designed. That said: I'd still like to try and come up with some additional black tests to verify only one node is triggering these deletes .. i've got some rough ideas, but nothing concrete – i'll keep thinking about it.

          Anybody see any problems with this approach?


          IMO there should be a default field name for ttl say _ttl even if no field name is specified

          I'd deliberately avoided doing that because I'm not a fan of "magic" field names and i wanted to ensure we supported the ability to use this processor with out any sort of TTL calculation – for people who just want to specify their own expiration field values explicitly.

          that said: having a sensible default probably would make the common case more useful – and we could always document (and test) using <null name="ttlFieldName"/> for people who wnat to disable it.

          I'll look into adding that tomorrow.

          Show
          Hoss Man added a comment - Ok - overseer is dead, long live the overseer! After investigating some different options for preventing too many nodes from triggering redundent deletes, what i came up with is this... In simple standalone instalations this method always returns true, but in cloud mode it will be true if and only if we are currently the leader of the (active) slice with the first name (lexigraphically). I outlined the reasoning why I think this is the most straightforward solution in the code... // This is a lot simpler then doing our own "leader" election across all replicas // of all shards since: // a) we already have a per shard leader // b) shard names must be unique // c) ClusterState is already being "watched" by ZkController, no additional zk hits // d) there might be multiple instances of this factory (in multiple chains) per // collection, so picking an ephemeral node name for our election would be tricky Watching the logs when running the tests, things look pretty good, and seem to be operating as designed. That said: I'd still like to try and come up with some additional black tests to verify only one node is triggering these deletes .. i've got some rough ideas, but nothing concrete – i'll keep thinking about it. Anybody see any problems with this approach? IMO there should be a default field name for ttl say _ttl even if no field name is specified I'd deliberately avoided doing that because I'm not a fan of "magic" field names and i wanted to ensure we supported the ability to use this processor with out any sort of TTL calculation – for people who just want to specify their own expiration field values explicitly. that said: having a sensible default probably would make the common case more useful – and we could always document (and test) using <null name="ttlFieldName"/> for people who wnat to disable it. I'll look into adding that tomorrow.
          Hide
          Hoss Man added a comment -

          that said: having a sensible default probably would make the common case more useful – and we could always document (and test) using <null name="ttlFieldName"/> for people who wnat to disable it.

          Updated patch adds support for _ttl_ as a default for both ttlFieldName and ttlParamName.

          Show
          Hoss Man added a comment - that said: having a sensible default probably would make the common case more useful – and we could always document (and test) using <null name="ttlFieldName"/> for people who wnat to disable it. Updated patch adds support for _ttl_ as a default for both ttlFieldName and ttlParamName .
          Hide
          ASF subversion and git services added a comment -

          Commit 1583734 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1583734 ]

          SOLR-5795: New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the TTL expression, as well as automatically deleting expired documents on a periodic basis

          Show
          ASF subversion and git services added a comment - Commit 1583734 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1583734 ] SOLR-5795 : New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the TTL expression, as well as automatically deleting expired documents on a periodic basis
          Hide
          ASF subversion and git services added a comment -

          Commit 1583741 from hossman@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1583741 ]

          SOLR-5795: New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the TTL expression, as well as automatically deleting expired documents on a periodic basis (merge r1583734)

          Show
          ASF subversion and git services added a comment - Commit 1583741 from hossman@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1583741 ] SOLR-5795 : New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the TTL expression, as well as automatically deleting expired documents on a periodic basis (merge r1583734)
          Hide
          ASF subversion and git services added a comment -

          Commit 1584097 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1584097 ]

          SOLR-5795: harden leader check to log cleanly (no NPE) in transient situations when there is no leader due to election in progress

          Show
          ASF subversion and git services added a comment - Commit 1584097 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1584097 ] SOLR-5795 : harden leader check to log cleanly (no NPE) in transient situations when there is no leader due to election in progress
          Hide
          ASF subversion and git services added a comment -

          Commit 1584099 from hossman@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1584099 ]

          SOLR-5795: harden leader check to log cleanly (no NPE) in transient situations when there is no leader due to election in progress (merge r1584097)

          Show
          ASF subversion and git services added a comment - Commit 1584099 from hossman@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1584099 ] SOLR-5795 : harden leader check to log cleanly (no NPE) in transient situations when there is no leader due to election in progress (merge r1584097)
          Hide
          Shalin Shekhar Mangar added a comment -

          Hoss, instead of slice.getLeader(), you should use ZkStateReader.getLeaderRetry method.

          Show
          Shalin Shekhar Mangar added a comment - Hoss, instead of slice.getLeader(), you should use ZkStateReader.getLeaderRetry method.
          Hide
          Hoss Man added a comment -

          Hoss, instead of slice.getLeader(), you should use ZkStateReader.getLeaderRetry method.

          That was actually a deliberate choice:

          These deletes are low priority and will re-occur frequently - so it's fine to abort quickly as a No-Op, no need to block waiting for a leader. These leader checks will also happen very often and on every node very often - so we don't want to be hammering Zk with active leader checks/retries in a potential high load / leader election / outage situation. The cached ClusterState info is "good enough" – even if it's stale, the worst case scenario is that multiple nodes trigger a handfull redundant deletes, or the deletes are skipped for one cycle – but the next one will take care of it.

          Show
          Hoss Man added a comment - Hoss, instead of slice.getLeader(), you should use ZkStateReader.getLeaderRetry method. That was actually a deliberate choice: These deletes are low priority and will re-occur frequently - so it's fine to abort quickly as a No-Op, no need to block waiting for a leader. These leader checks will also happen very often and on every node very often - so we don't want to be hammering Zk with active leader checks/retries in a potential high load / leader election / outage situation. The cached ClusterState info is "good enough" – even if it's stale, the worst case scenario is that multiple nodes trigger a handfull redundant deletes, or the deletes are skipped for one cycle – but the next one will take care of it.
          Hide
          Shalin Shekhar Mangar added a comment -

          That makes sense, Thanks for explaining.

          Show
          Shalin Shekhar Mangar added a comment - That makes sense, Thanks for explaining.
          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development