[HDFS-8881] Erasure Coding: internal blocks got missed and got over-replicated at the same time - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: erasure-coding
Labels:
None

Description

We know the Repl checking depends on BlockManager#countNodes(), but countNodes() has limitation for striped blockGroup.

One missing internal block will be catched by Repl checking, and handled by ReplicationMonitor.
One over-replicated internal block will be catched by Repl checking, and handled by processOverReplicatedBlocks.
One missing internal block and two over-replicated internal blocks at the same time will be catched by Repl checking, and handled by processOverReplicatedBlocks, later by ReplicationMonitor.
One missing internal block and One over-replicated internal block at the same time will NOT be catched by Repl checking.

"at the same time" means one missing internal block can't be recovered, and one internal block got over-replicated anyway. For example:

scenario A:
step 1. block #0 and #1 are reported missing.
2. a new #1 got recovered.
3. the old #1 come back, and the recovery work for #0 failed.

scenario B:
1. An DN decommissioned/dead which has #1.
2. block #0 is reported missing.
3. The DN has #1 recommisioned, and the recovery work for #0 failed.

In the end, the blockGroup has [1, 1, 2, 3, 4, 5, 6, 7, 8], assume 6+3 schema. Client always needs to decode #0 if the blockGroup doesn't get handled.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-8881.00.patch
08/Aug/15 07:27
6 kB
Walter Su

Issue Links

duplicates

HDFS-14699 Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold

Resolved

Activity

People

Assignee:: Walter Su

Reporter:: Walter Su

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 08/Aug/15 07:11

Updated:: 03/Oct/19 00:03

Resolved:: 03/Oct/19 00:03