Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Invalid
-
3.0.0-alpha1
-
None
-
None
Description
Our test cluster has faced a problem that postponedMisreplicatedBlocksCount has been going below zero. The version of the cluster is a recent 3.0. We haven't created any EC files yet. This is the NN's log:
Rescan of postponedMisreplicatedBlocks completed in 13 msecs. 448 blocks are left. 176 blocks are removed. Rescan of postponedMisreplicatedBlocks completed in 13 msecs. 272 blocks are left. 176 blocks are removed. Rescan of postponedMisreplicatedBlocks completed in 14 msecs. 96 blocks are left. 176 blocks are removed. Rescan of postponedMisreplicatedBlocks completed in 327 msecs. -77 blocks are left. 177 blocks are removed. Rescan of postponedMisreplicatedBlocks completed in 15 msecs. -253 blocks are left. 179 blocks are removed. Rescan of postponedMisreplicatedBlocks completed in 14 msecs. -432 blocks are left. 179 blocks are removed.
I looked into this issue and found that it is caused by LightWeightHashSet which is used for postponedMisreplicatedBlocks recently. When LightWeightHashSet remove blocks which have a large number blockId, overflows happen and the blocks can't be removed correctly(, let alone ec blocks whose blockId starts with the minimum of long).
Attachments
Issue Links
- relates to
-
HDFS-8792 BlockManager#postponedMisreplicatedBlocks should use a LightWeightHashSet to save memory
- Resolved