[HDFS-8791] block ID-based DN storage layout can be very slow for datanode on ext4 - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.6.0, 2.8.0, 2.7.1
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: datanode
Labels:
None

Target Version/s:

2.8.0
Hadoop Flags:

Reviewed
Release Note:

Hide
~~HDFS-8791~~ introduces a new datanode layout format. This layout is identical to the previous block id based layout except it has a smaller 32x32 sub-directory structure in each data storage. On startup, the datanode will automatically upgrade it's storages to this new layout. Currently, datanode layout changes support rolling upgrades, on the other hand downgrading is not supported between datanode layout changes and a rollback would be required.

Show
HDFS-8791 introduces a new datanode layout format. This layout is identical to the previous block id based layout except it has a smaller 32x32 sub-directory structure in each data storage. On startup, the datanode will automatically upgrade it's storages to this new layout. Currently, datanode layout changes support rolling upgrades, on the other hand downgrading is not supported between datanode layout changes and a rollback would be required.

Description

underlined textWe are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout.

The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed.

So, what we have on disk is:

256 inodes for the first level directories
256 directory blocks for the first level directories
256*256 inodes for the second level directories
256*256 directory blocks for the second level directories
Then the inodes and blocks to store the the HDFS blocks themselves.

The main problem is the 256*256 directory blocks.

inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general).

Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far.

In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes.

How do things get cold?

A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster.

Why didn't the previous layout see this?

It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds.
With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-8791-trunk-v1.patch
18/Nov/15 04:39
3 kB
Chris Trezzo
32x32DatanodeLayoutTesting-v1.pdf
18/Nov/15 04:39
124 kB
Chris Trezzo
32x32DatanodeLayoutTesting-v2.pdf
26/Nov/15 01:35
139 kB
Chris Trezzo
HDFS-8791-trunk-v2.patch
01/Dec/15 05:23
8 kB
Chris Trezzo
hadoop-56-layout-datanode-dir.tgz
01/Dec/15 05:23
194 kB
Chris Trezzo
HDFS-8791-trunk-v2.patch
01/Dec/15 14:20
8 kB
Kihwal Lee
HDFS-8791-trunk-v2-bin.patch
01/Dec/15 15:00
203 kB
Kihwal Lee
test-node-upgrade.txt
03/Dec/15 06:39
37 kB
Chris Trezzo
HDFS-8791-trunk-v3-bin.patch
29/Feb/16 22:15
215 kB
Chris Trezzo

Issue Links

is blocked by

HDFS-8578 On upgrade, Datanode should process all storage/data dirs in parallel

Closed

is broken by

HDFS-6482 Use block ID-based block layout on datanodes

Closed

is related to

HDFS-8782 Upgrade to block ID-based DN storage layout delays DN registration

Resolved

HDFS-8873 Allow the directoryScanner to be rate-limited

Resolved

relates to

HDFS-8858 DU should be re-executed if the target directory exists

Open

HADOOP-10434 Is it possible to use "df" to calculate the dfs usage instead of "du"

Resolved

(1 relates to)

block ID-based DN storage layout can be very slow for datanode on ext4

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates