Maybe we should call this near-real-time get?
That sort of defeats the purpose of the issue - it's supposed to be a 100% reliable get of the latest version of a document.
Right, it will always return the last added doc under that ID; I'm not
disputing that part.
I am disputing that it's really "real-time" given that it's built on
top of "near-real-time". Ie calling this real-time is over-selling
it, I think; the performance will not be great?
Another thing to consider is NRTCachingDir; it's good for reducing
latency when you are frequently flushing tiny segments (make the
reopen IO-less, except for the ID lookups, unless you use MemCodec, at
which point the NRT open is fully IO free).
The approach here is to always reopen the reader on-demand when a RT get arrives, ie, if any changes had been made to the index with IndexWriter?
I was thinking ahead to a more generic version where one could specify the clock (I think this will be needed for future distrib indexing support). I actually first added a version that took an explicit clock but then simplified it to always use the latest clock and marked it as experimental.
What kind of "clocks" would one want to plug in here? Do you mean you
could choose to accept some staleness if you wanted (plug in a clock
that only increments periodically if there had been updates)?
But, stepping back, this approach (open new NRT reader on demand) seems dangerous? Ie perf will be poor if a client has one thread constantly updating and another constantly doing RT get?
It's better than what we have today, and it can be optimized in the future.
I agree, progress not perfection.
One way would be with a bloom filter of updates that are not yet visible. Another way will again relate to recovery in distributed indexing, when we'll need to ask another node what all the latest updates after clock x were (and since we'll have those on hand, we can check any realtime-get against that first).
Maybe Solr should use a transaction log (like ElasticSearch)? I think
(not certain) that ES serves a RT get directly out of its transaction
log if the doc is in it (else falls back to the reader)? Then
simultaneous updates + gets should really be real-time. But I
realize that'd be a much bigger change...