[HDFS-15186] Erasure Coding: Decommission may generate the parity block's content with all 0 in some case - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.0.3, 3.2.1, 3.1.3
Fix Version/s: 3.3.0
Component/s: datanode, erasure-coding
Labels:
None

Target Version/s:

3.3.0, 3.1.4, 3.2.2
Hadoop Flags:

Reviewed
Flags:

Patch, Important

Description

I can find some parity block's content with all 0 when i decommission some DataNode(more than 1) from a cluster. And the probability is very big(parts per thousand).This is a big problem.You can think that if we read data from the zero parity block or use the zero parity block to recover a block which can make us use the error data even we don't know it.

There is some case in the below:

B: Busy DataNode,

D:Decommissioning DataNode,

Others is normal.

1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].

2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].

....

In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)], the DN may received reconstruct block command and the liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in the class StripedReconstructionInfo) length is 2.

The targets's length is 2 which mean that the DataNode need recover 2 internal block in current code.But from the liveIndices we only can find 1 missing block, so the method StripedWriter#initTargetIndices will use 0 as the default recover block and don't care the indices 0 is in the sources indices or not.

When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] use the ec algorithm.We can find that the indices [0] is in the both the sources indices and the targets indices in this case. The returned target buffer in the indices [6] is always 0 from the ec algorithm.So I think this is the ec algorithm's problem. Because it should more fault tolerance.I try to fixed it .But it is too hard. Because the case is too more. The second is another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm with a correct parameters. Which mean that remove the duplicate target indices 0 in this case.Finally, I fixed it in this way.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-15186.001.patch
20/Feb/20 09:19
4 kB
Yao Guangdong
HDFS-15186.002.patch
25/Feb/20 02:33
5 kB
Yao Guangdong
HDFS-15186.003.patch
25/Feb/20 06:24
5 kB
Yao Guangdong
HDFS-15186.004.patch
26/Feb/20 06:43
6 kB
Yao Guangdong
HDFS-15186.005.patch
27/Feb/20 01:57
6 kB
Yao Guangdong

Issue Links

relates to

HDFS-14768 EC : Busy DN replica should be consider in live replica check.

Resolved

Activity

People

Assignee:: Yao Guangdong

Reporter:: Yao Guangdong

Votes:: 1 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 20/Feb/20 08:25

Updated:: 04/Jan/22 04:32

Resolved:: 27/Feb/20 19:04