[HDFS-17232] RBF: Fix NoNamenodesAvailableException for a long time, when use observer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: rbf
Labels:
- pull-request-available

Target Version/s:

3.4.0
Hadoop Flags:

Reviewed

Description

Describe

I solved the NoNamenodesAvailableException for a long time, when failover without using observer, but when using observer, there are still many problems.

When the observer fails and there is no active namenode at this time, even if we can rotate the cache, the next request will shuffle the observer namenode to the front of the cache due to the use of the observer, so retry will still send the request to the failed observer node.
If there are multiple observers, and an exception occurs when accessing an observer and there is no active namenode at this time, a NoNamenodesAvailableException will be caused and the server will try again. However, since using the observer will put the observer node at the front of the cache, it may still fail.
When there are multiple observers, one of which is unavailable and there is no active namenode at this time, we should continue to try the next observer, so that the currently unavailable observer can be marked as unavailable, and subsequent requests can avoid the unavailable observer.
If it is due to an illegal operation, that is, even if the operation is sent to the active namenode, an exception will occur, resulting in NoNamenodesAvailableException. If the cache is rotated at this time, the next normal request will be sent to the namenode that is indeed the standby, causing an error in the legal request. , so illegal operations should not rotate the cache.

Detailed bug description: ~~HDFS-17166~~

case 1:

router's cache : [ observer-1(problematic), standby-2, standby-3(actually active) ]

client read -> observer-1 throw NoNamenodesAvailableException -> rotate the cache -> [ standby-2, standby-3(actually active),observer-1(problematic) ]

client retry read -> shuffleObserverNN -> [ observer-1(problematic), standby-2, standby-3(actually active) ] -> observer-1 throw NoNamenodesAvailableException -> rotate the cache -> [ standby-2, standby-3(actually active),observer-1(problematic) ]

.....

client (reties > max.attempts ) -> Read failed

case 2:

router's cache : [ observer-1(problematic), observer-2, standby-3, standby-4(actually active) ]

client read -> observer-1 throw NoNamenodesAvailableException -> rotate the cache -> [ observer-2, standby-3, standby-4(actually active),observer-1(problematic) ]

client retry read -> shuffleObserverNN -> [ observer-1(problematic), observer-2, standby-3, standby-4(actually active) ] (may happen) -> observer-1 throw NoNamenodesAvailableException -> rotate the cache -> [ observer-2, standby-3, standby-4(actually active),observer-1(problematic) ]

.....

client may (reties > max.attempts ) -> Read failed

case 3:

router's cache : [ standby-1, standby-2(actually active) ]

client request -> standby-1 throw NoNamenodesAvailableException -> rotate the cache -> [ standby-2(actually active),standby-1 ]

client retry request -> standby-2(actually active) success

client Illegal request -> standby-2(actually active) throw NoNamenodesAvailableException -> rotate the cache -> [standby1, standby-2(actually active) ]

client legal request -> standby1 throw NoNamenodesAvailableException failed

How to reproduce

I have provided unit tests:TestNoNamenodesAvailableLongTime

You can use the original code and run my new unit tests to reproduce the above problems.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-17232.001.patch
20/Oct/23 08:31
30 kB
Jian Zhang

Issue Links

links to

GitHub Pull Request #6208

Activity

People

Assignee:: Jian Zhang

Reporter:: Jian Zhang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Oct/23 08:22

Updated:: 28/Jan/24 01:21

Resolved:: 03/Dec/23 16:04