added a comment - Ok, i've been going around in circles on this issue for the past few weeks, but i've finally got something that feels like progress.
Listing everything new since the last patch would be fairely tedioius, so i'll focus on the broad strokes...
punting on deletes (for now)
the previuos processDelete method was totally broken in distributed queries (was doing equality comparisons of Strings directly to enum objects, made assumptions about isLeader pased on previous "add" commands in the same request, etc...)
i was having enough time struggling with getting adds to work properly, that i just removed the delete code completely
based on the rest of the progress (see below) re-adding support for deletes should be straight forward, but will require some refactoring of how the errors are tracked & counted to distinguish between an "add doc w/id=22" that fails and a "delete doc w/id=22" that fails in the same request
Dealing with forward(ing/ed) requests and async distributed failures
Most of the meat of the test failures in the last patch came from dealing with the async requests fired off my
DistributedUpdateProcessor and how to deal with failures reported by other leaders.
I started down the road of trying to do a bettter job in SolrCmdDistributor of tracking the UpdateCommand that corisponds with each Req so that when processing the Error we could know what failed remotely – this code is still in the patch (because i think it's cleaner then the cmdString tracking currently in Solr) but proved mostly useless because of how ConcurrentUpdateSolrClient can combine multiple "requests" together.
The next step, which mostly worked, was to improve the error handling in DUP's finish() so that instead of aborting with info about whatever (remote) Error happened to return first, it now returns a new DistributedUpdatesAsyncException which wraps & remembers the summary info of all the remote errors encountered – or at least, all of the errors that were previously candidates to tell the user about. Stuff that was swallowed & logged before is still swallowed & logged.
One notable change her is that i switched DUP.finish() from directly calling SOlrQueryResponse.setException() and instead made it throw the exception. Independent of this issue, the existing behavior seems like a bug / bad-form – what if the caller already caught some earlier exception it wants to return and finish() is just being called in finally?
This lead me to discover SOLR-8633 – the patch from that issue is currently including in this patch because it's so neccessary for hte current code.
FWIW: if, for some reason, folks think calling SOlrQueryResponse.setException() is better for some wacko reason, then TolerantUpdateProcess.finish() could, in theory, go check SOlrQueryResponse.getException() and treat it the same way as exceptions it catches (see below) but that's a lot more tedious and (and error prone in the long run)
Once DUP.finish() started throwing DistributedUpdatesAsyncException , tracking all the various errors we care about from distributed requests became possible...
TolerantUpdateProcessor now pays attention to when the request is a TOLEADER forwarded request, and if so:
it acts tolerant of up to maxErrors failures (just like single node)
in finish() , if any failures happened, return the first error – and annotate it with info about all the failures in this request, using the existing SolrException.getMetadata() map.
ConcurrentUpdateSolrClient already ensures that if a SolrException happens on the remote server, any "metadata" included in the response for that exception is copied into the local SolrException it constructs
So if/when TolerantUpdateProcessor.finish() catches a DistributedUpdatesAsyncException , it loops over the wrapped exceptions, and pulls out the metadata for each of those to update it's list of known failures
which error is returned if maxErrors exceeded: now the "first" one
i mentioned this above in discussing how async errors from other leaders are handled, but i wanted to emphasis it and elaborate a bit
in the original patch, exceptions were completley ignored until maxErrors were seen, at which point hte next exception was immediately (re)-thrown
that made sense for single node cases, but in a distributed case, with async exceptions, we can't always just "rethrow" an exception
it's also ambiguious what the "last" exception really is in an async situation.
so now, instead, TolerantUpdateProcessor always remembers the first exception it caught, and if/when maxErrors is exceeded, it re-throws that first exception
later, in finish() that (remembered) exception is annotated with metadata about all the failures
this metadata is critical for forwarded requests, but should also be useful for SolrClient users who (should) see it copied into a RemoteSolrException even if they don't get the normal UpdateResponse object and it's getResponseHeader()
this, in my opinion, makes the behavior of using TolerantUpdateProcessor more useful (and consistent with how stock solr works) when there is a fundamental problem with your data – you get an error indicating the first problem document in your data, as opposed to seeing an error from the 11th, or 101th, or maxErrors+1 malformed document in your data.
if folks disagree, we just need to re-work the FirstErrTracker class to be a LastErrTracker class...
FirstErrTracker.throwFirst() becomes LastErrTracker.throwLast()
instead of checking null == first , LastErrorTracker.caught(Throwable) would ignore any additional exceptions once true == thrown
the problem with returning numAdds (and numDeletes if we want that)
getting numAdds to work correctly in a distributed request is really hard
currently the code (specifically on the first node processing the AddUpdateCommand just requires that super.processAdd() succeeds to do numAdds++
this gives a missleading number when the failures aren't reported immediately because they were forwarded async to a diff leader and we only find out about problems in finish()
even if we only did numAdds++ for docs where we know we are the leader, getting the count from other leaders later is tricky
right now, the only way TolerantUpdateProcessor learns the results of any async requests to other leaders is if DUP.finish() throws an error – we can get numAdds from the metadata of those errors, but that doesn't help the case when requests succeed
DUP doesn't currently do anything with successful responses, so there's no easy way to get the numAdds in that case
eliminate the concept of numAdds from this feature, focus solely on being tolerant and reporting the errors (which we can do accurately)
if people want to know how many succeeded, they can use features like versions=true explicitly – although even then, i'm not sure if it will work reliably today ... i don't see any indication that DUP merges the remote responses when that feature is used.
make TolerantUpdateProcessor detect when it's not a leader for something, and in that case implicitly add version=true to get the info from requests we forward to, and prune the response down to just a simple numAdds count in finish()
same problem as above if i'm correct about versions=true not currently handled correctly in forward(ing/ed) requests
return distinct numAddsAttempted and numAddsConfirmed values – probably not as useful to end clients, but more accurate representation of what we know...
track a distinct "numAddsAttempted" for every shard we forward to (added together at end of request)
kind of a sketchy pain in the ass to do
assume numAddsConfirmed = numAddsAttempted for any other shard leader we dont see an exception from
for shard leaders we do get exceptions from, add to our numAddsConfirmed based on the metadata (ie: numAddsConfirmed + remoteException.getMetatata(NUM_ADDS_CONFIRMED)
treat it as a distinct feature in a new jira
geting numAdds (or numDeletes ) count in the responseHeader really feels like it should be an orthoginal feature to being tolerant of maxErrors
we should open a distinct issue to track adding that as a feature, and ensuring that DUP aggregates correctly from all the various leaders that requests get forwarded to
this is waht i personally think we should do – notably because it would make it trivial to know how many documents you added when streaming a bunch of data, even if you don't use this update processor.
the nuance of the maxErrors=N param
i just want to point out – in a distributed cloud setup, there is no way to truely enforce a hard limit at maxErrors
the async processing means that if a request includes docs destined for diff shards, we can't ensure that only that max amount of errors will be hit – we might hit more errors in async threads before we notice and stop processing
i don't think this is a problem, it's a feature of handling updates to diff shards/leaders in paralel, but it does mean we might want to rethink the param name ? ( errorTolleranceThreshold=N perhaps?
There's still a lot of work todo, in no particular order...
need to refactor the way we track errors (both locally and stashed in the SolrException metadata) so that we can we can distinguish "add doc w/id=22" faiulres from "delete doc w/id=22" failures
probably need to rethink the responseHeader formatting for how errors are tracked as well in order to distinguish them
once we have that, adding a processDeletes(...) method that works similar to processAdd should be trivial
need to beef up the cloud testing to include delete checks
all of the tests using CloudSolrClient currently fail because of how that client does it's own "direct to leaders" splitting/merging of docs destined for diff shards.
a bunch of new code needs written there (may be able to refactor/share some stuff from the error list merging code in TolerantUpdateProcessor (see above about refactoring that for deletes)
need lots more randomized testing
tons of nocommits that need cleaned up