[HDFS-11179] LightWeightHashSet can't remove blocks correctly which have a large number blockId - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Invalid
Affects Version/s: 3.0.0-alpha1
Fix Version/s: None
Component/s: namenode
Labels:
None

Description

Our test cluster has faced a problem that postponedMisreplicatedBlocksCount has been going below zero. The version of the cluster is a recent 3.0. We haven't created any EC files yet. This is the NN's log:

Rescan of postponedMisreplicatedBlocks completed in 13 msecs. 448 blocks are left. 176 blocks are removed.
Rescan of postponedMisreplicatedBlocks completed in 13 msecs. 272 blocks are left. 176 blocks are removed.
Rescan of postponedMisreplicatedBlocks completed in 14 msecs. 96 blocks are left. 176 blocks are removed.
Rescan of postponedMisreplicatedBlocks completed in 327 msecs. -77 blocks are left. 177 blocks are removed.
Rescan of postponedMisreplicatedBlocks completed in 15 msecs. -253 blocks are left. 179 blocks are removed.
Rescan of postponedMisreplicatedBlocks completed in 14 msecs. -432 blocks are left. 179 blocks are removed.

I looked into this issue and found that it is caused by LightWeightHashSet which is used for postponedMisreplicatedBlocks recently. When LightWeightHashSet remove blocks which have a large number blockId, overflows happen and the blocks can't be removed correctly(, let alone ec blocks whose blockId starts with the minimum of long).

Attachments

Issue Links

relates to

HDFS-8792 BlockManager#postponedMisreplicatedBlocks should use a LightWeightHashSet to save memory

Resolved

Activity

People

Assignee:: Takanobu Asanuma

Reporter:: Takanobu Asanuma

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 28/Nov/16 03:50

Updated:: 25/Oct/19 20:25

Resolved:: 06/Dec/16 09:02