[HDFS-16115] Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.3.1
Fix Version/s: 3.3.1
Component/s: datanode
Labels:
None

Target Version/s:

3.3.1

Description

It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself,

//代码占位符
2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when process queue BPServiceActor.java:1393
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:717)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)

currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply removed from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPServiceActor thread. the interval is also configurable.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-HDFS-16115.patch
06/Jul/21 09:27
14 kB
Daniel Ma

Activity

People

Assignee:: Daniel Ma

Reporter:: Daniel Ma

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Jul/21 09:25

Updated:: 22/Dec/22 08:11