[HDFS-9046] Any Error during BPOfferService run can leads to Missing DN. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

The cluster is ins HA mode and each DN having only one block pool.

The issue is once after switch one DN is missing from the current active NN.
Upon analysis I found that there is one exception in BPOfferService.run()

2015-08-21 09:02:11,190 | WARN  | DataNode: [[[DISK]file:/srv/BigData/hadoop/data5/dn/ [DISK]file:/srv/BigData/hadoop/data4/dn/]]  heartbeating to 160-149-0-114/160.149.0.114:25000 | Unexpected exception in block pool Block pool BP-284203724-160.149.0.114-1438774011693 (Datanode Uuid 15ce1dd7-227f-4fd2-9682-091aa6bc2b89) service to 160-149-0-114/160.149.0.114:25000 | BPServiceActor.java:830
java.lang.OutOfMemoryError: unable to create new native thread
                at java.lang.Thread.start0(Native Method)
                at java.lang.Thread.start(Thread.java:714)
                at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
                at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357)
                at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:172)
                at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:221)
                at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:1887)
                at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:669)
                at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:616)
                at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:856)
                at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:671)
                at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:822)
                at java.lang.Thread.run(Thread.java:745)

After this particular BPOfferService is down during the run time.
And this particular NN will not have the details of this DN

Similar issues are discussed in the following JIRAs
https://issues.apache.org/jira/browse/HDFS-2882
https://issues.apache.org/jira/browse/HDFS-7714

Can we retry in this case also with a larger interval instead of shutting down this BPOfferService ?
I think since this exceptions can occur randomly in DN it is not good to keep the DN running where some NN does not have the info !

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-9046_1.patch
10/Sep/15 14:15
3 kB
nijel
HDFS-9046_2.patch
16/Sep/15 08:26
8 kB
nijel
HDFS-9046_3.patch
20/Sep/15 09:46
8 kB
nijel

Issue Links

is duplicated by

HDFS-9684 DataNode stopped sending heartbeat after getting OutOfMemoryError form DataTransfer thread.

Resolved

Activity

People

Assignee:: nijel

Reporter:: nijel

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 10/Sep/15 14:10

Updated:: 29/Jan/16 19:32