Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0
-
Reviewed
Description
Describe
I solved the NoNamenodesAvailableException for a long time, when failover without using observer, but when using observer, there are still many problems.
- When the observer fails and there is no active namenode at this time, even if we can rotate the cache, the next request will shuffle the observer namenode to the front of the cache due to the use of the observer, so retry will still send the request to the failed observer node.
- If there are multiple observers, and an exception occurs when accessing an observer and there is no active namenode at this time, a NoNamenodesAvailableException will be caused and the server will try again. However, since using the observer will put the observer node at the front of the cache, it may still fail.
- When there are multiple observers, one of which is unavailable and there is no active namenode at this time, we should continue to try the next observer, so that the currently unavailable observer can be marked as unavailable, and subsequent requests can avoid the unavailable observer.
- If it is due to an illegal operation, that is, even if the operation is sent to the active namenode, an exception will occur, resulting in NoNamenodesAvailableException. If the cache is rotated at this time, the next normal request will be sent to the namenode that is indeed the standby, causing an error in the legal request. , so illegal operations should not rotate the cache.
Detailed bug description: HDFS-17166
- case 1:
- router's cache : [ observer-1(problematic), standby-2, standby-3(actually active) ]
- client read -> observer-1 throw NoNamenodesAvailableException -> rotate the cache -> [ standby-2, standby-3(actually active),observer-1(problematic) ]
- client retry read -> shuffleObserverNN -> [ observer-1(problematic), standby-2, standby-3(actually active) ] -> observer-1 throw NoNamenodesAvailableException -> rotate the cache -> [ standby-2, standby-3(actually active),observer-1(problematic) ]
- .....
- client (reties > max.attempts ) -> Read failed
- case 2:
- router's cache : [ observer-1(problematic), observer-2, standby-3, standby-4(actually active) ]
- client read -> observer-1 throw NoNamenodesAvailableException -> rotate the cache -> [ observer-2, standby-3, standby-4(actually active),observer-1(problematic) ]
- client retry read -> shuffleObserverNN -> [ observer-1(problematic), observer-2, standby-3, standby-4(actually active) ] (may happen) -> observer-1 throw NoNamenodesAvailableException -> rotate the cache -> [ observer-2, standby-3, standby-4(actually active),observer-1(problematic) ]
- .....
- client may (reties > max.attempts ) -> Read failed
- case 3:
- router's cache : [ standby-1, standby-2(actually active) ]
- client request -> standby-1 throw NoNamenodesAvailableException -> rotate the cache -> [ standby-2(actually active),standby-1 ]
- client retry request -> standby-2(actually active) success
- client Illegal request -> standby-2(actually active) throw NoNamenodesAvailableException -> rotate the cache -> [standby1, standby-2(actually active) ]
- client legal request -> standby1 throw NoNamenodesAvailableException failed
How to reproduce
I have provided unit tests:TestNoNamenodesAvailableLongTime
You can use the original code and run my new unit tests to reproduce the above problems.
Attachments
Attachments
Issue Links
- links to