Content should never be removed without explicit external interactions and this will lead to so many "where did my content go" type problems.. Especially since once its gone from the index debugging what went wrong is not going to be easy.. writing a script to send a query delete periodically is really not that complex and then it becomes the responsibility of the content owner/developer to delete content..
I'm not sure i follow your reasoning there – "where did my content go" type situations can already exist via any deleteByQuery (not to mention really subtle things like SignatureUpdateProcessorFactory). If anything the approach i'm suggesting should be more obvious then an external script – because it would need to be configured right there in the solrconfig.xml where it's obvious and easy to see, as opposed to "where did my content go? ... time to wade through days of logs looking for deleteByQuey requests that could be coming from anywhere, at any interval of time."
The bottom line, is that someone with the ability to edit solrconfig.xml already has the ability to trump & manipulate & block & mess with content sent from remote clients by content owners/developers – this would in fact be another way to do that, but i don't think that's a bad thing. It would just be a simpler, self contained, way for solr admins to say "I want to have a way to automatically expire content that people put in my index"
I would suggest that if this does go in some sort of "audit" output be produced (eg X docs delete automatically or a list of ids)
that would be really nice in general with any sort of deleteByQuery – but it's not currently possible to get that info back from the IndexWriter. The best we can do is explicitly log when/why we are triggering the automatic deleteByQuery commands
I'm attaching a patch with a really rough proof of concept for the design outlined above ... still a lot of nocommits & error checking & tests needed, but it gives you something to try out to see what I had in mind.
With this patch applied, you can startup the example and load docs along the lines of this...
java -Durl="http://localhost:8983/solr/collection1/update?update.chain=nocommit" -Ddata=args -jar post.jar '<add><doc><field name="id">EXP</field><field name="_expire_at_">NOW+8MINUTES</field></doc><doc><field name="id">SAFE</field></doc><doc><field name="id">TTL</field><field name="_ttl_">+3MINUTES</field></doc></add>'
- Every 5 minutes, a thread will wake up and delete docs
- EXP has an explicit value in the _expire_at_ of 8 minutes after it was indexed – if you index the docs immediatley after starting up Solr, it should be deleted ~10Minutes after startup.
- TTL has an implicit _ttl_ value of 3 minutes after it was indexed, which the processor converts to an absolute value and puts in the _expire_at_ – if you index the docs immediatley after starting up Solr, it should be deleted ~5Minutes after startup.
- SAFE will never be deleted, because nothing gives it a value in the _expire_at_ field.
One note where we definitely have to deviate from what i described initially: having hte scheduled task use the factory to access the chain to trigger the delete didn't pan out because i wasn't thinking clearly enough about what that existing API looks like – the factory doesn't know what chain it's in, or what processor(s) should be "next", that's input to getInstance() method on the factory from the chain. so instead the configuration requires you to specify the name of a chain (can be the same chain you are in) and that chain is used to execute the delete.
(The trickiest part of all of this, will be writing the tests)