Affects Version/s: 0.7, 0.8
Fix Version/s: 0.8
Under heavy load, hits returned by DistributedSearch.Client can become out of sync with the Client's live server list.
DistributedSearch.Client maintains an array of live search servers (liveAddresses). This array is updated at intervals by a watchdog thread. When the Client returns hits from a search, it tracks which hits came from which server by saving an index into the liveAddresses array (as Hit.indexNo).
The problem occurs when the search servers cannot service some remote procedure calls before the client times out (due to heavy load, for example). If the Client returns some Hits from a search, and then the array of liveAddresses changes while the Hits are still being used, the indexNos for those Hits can become invalid, referring to different servers than the Hit originated from (or no server at all!).
Symptoms of this problem include:
- ArrayIndexOutOfBoundsException (when the array of liveAddresses shrinks, a Hit from the last server in liveAddresses in the previous update cycle now has an indexNo past the end of the array)
- IOException: read past EOF (suppose a hit comes back from server A with a doc number of 1000. Then the watchdog thread updates liveAddresses and now the Hit looks like it came from server B, but server B only has 900 documents. Trying to get details for the hit will read past EOF in server B's index.)
- Of course, you could also get a "silent" failure in which you find a hit on server A, but the details/summary are fetched from server B. To the user, it would simply look like an incorrect or nonsense hit.
We have solved this locally by removing the liveAddresses array. Instead, the watchdog thread updates an array of booleans (same size as the array of defaultAddresses) that indicate whether that address responded to the latest call from the watchdog thread. Hit.indexNo is then always an index into the complete array of defaultAddresses, so it is stable and always valid. Callers of getDetails()/getSummary()/etc. must still be aware that these methods may return null when the corresponding server is unable to respond.