[HDFS-14476] lock too long when fix inconsistent blocks between disk and in-memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0, 2.7.0, 3.0.3
Fix Version/s: 3.3.0, 2.10.1
Component/s: datanode
Labels:
None

Description

When directoryScanner have the results of differences between disk and in-memory blocks. it will try to run checkAndUpdate to fix it. However FsDatasetImpl.checkAndUpdate is a synchronized call

As I have about 6millions blocks for every datanodes and every 6hours' scan will have about 25000 abnormal blocks to fix. That leads to a long lock holding FsDatasetImpl object.

let's assume every block need 10ms to fix(because of latency of SAS disk), that will cost 250 seconds to finish. That means all reads and writes will be blocked for 3mins for that datanode.

2019-05-06 08:06:51,704 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing metadata files:23574, missing block files:23574, missing blocks in memory:47625, mismatched blocks:0

...

2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Took 588402ms to process 1 commands from NN

Take long time to process command from nn because threads are blocked. And namenode will see long lastContact time for this datanode.

Maybe this affect all hdfs versions.

how to fix:

just like process invalidate command from namenode with 1000 batch size, fix these abnormal block should be handled with batch too and sleep 2 seconds between the batch to allow normal reading/writing blocks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-14476-branch-2.10.02.patch
09/Apr/20 12:05
3 kB
Ayush Saxena
HDFS-14476-branch-2.02.patch
14/Jan/20 08:23
3 kB
Sean Chow
HDFS-14476.branch-3.2.001.patch
19/Aug/19 01:01
2 kB
Wei-Chiu Chuang
HDFS-14476.002.patch
19/Aug/19 00:49
2 kB
Wei-Chiu Chuang
HDFS-14476.01.patch
07/Aug/19 08:17
2 kB
Sean Chow
HDFS-14476-branch-2.01.patch
07/Aug/19 08:17
2 kB
Sean Chow
datanode-with-patch-14476.png
08/Jul/19 14:21
93 kB
Sean Chow
HDFS-14476.00.patch
08/Jul/19 14:14
2 kB
Sean Chow

Issue Links

breaks

HDFS-14751 Synchronize on diffs in DirectoryScanner

Resolved

causes

HDFS-15048 Fix findbug in DirectoryScanner

Resolved

is duplicated by

HDFS-14126 DataNode DirectoryScanner holding global lock for too long

Resolved

is related to

HDFS-14126 DataNode DirectoryScanner holding global lock for too long

Resolved

Activity

People

Assignee:: Sean Chow

Reporter:: Sean Chow

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 07/May/19 05:20

Updated:: 13/Apr/20 07:10

Resolved:: 13/Apr/20 07:10