Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
Internally we found that reading from ObserverNode may result to BlockMissingException. This may happen when the observer sees a smaller number of DNs than active (maybe due to communication issue with those DNs), or (we guess) late block reports from some DNs to the observer. This error happens in DFSInputStream#chooseDataNode, when no valid DN can be found for the LocatedBlock got from the NN side.
One potential solution (although a little hacky) is to ask the DFSInputStream to retry active when this happens. The retry logic already present in the code - we just have to dynamically set a flag to ask the ObserverReadProxyProvider try active in this case.
cc shv, xkrogen, vagarychen, zero45 for discussion.