[HDFS-4660] Block corruption can happen during pipeline recovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.0.3-alpha, 3.0.0-alpha1
Fix Version/s: 2.8.0, 2.7.1, 2.6.4, 3.0.0-alpha1
Component/s: datanode
Labels:
None

Target Version/s:

2.7.1, 2.6.4
Hadoop Flags:

Reviewed

Description

pipeline DN1 DN2 DN3
stop DN2

pipeline added node DN4 located at 2nd position
DN1 DN4 DN3

recover RBW
DN4 after recover rbw
2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
getNumBytes() = 134144
getBytesOnDisk() = 134144
getVisibleLength()= 134144
end at chunk (134144/512=262)

DN3 after recover rbw
2013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
getNumBytes() = 134028
getBytesOnDisk() = 134028
getVisibleLength()= 134028

client send packet after recover pipeline
offset=133632 len=1008

DN4 after flush
2013-04-01 21:02:31,779 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1063
// meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 1063.

DN3 after flush
2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, lastPacketInBlock=false, offsetInBlock=134640, ackEnqueueNanoTime=8817026136871545)
2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing meta file offset of block BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 1055 to 1051
2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1059

After checking meta on DN4, I found checksum of chunk 262 is duplicated, but data not.
Later after block was finalized, DN4's scanner detected bad block, and then reported it to NM. NM send a command to delete this block, and replicate this block from other DN in pipeline to satisfy duplication num.

I think this is because in BlockReceiver it skips data bytes already written, but not skips checksum bytes already written. And function adjustCrcFilePosition is only used for last non-completed chunk, but
not for this situation.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

periodic_hflush.patch
25/Aug/16 15:06
3 kB
Nathan Roberts
HDFS-4660.v2.patch
18/May/15 21:25
8 kB
Kihwal Lee
HDFS-4660.patch
02/Apr/13 06:55
2 kB
Peng Zhang
HDFS-4660.patch
13/May/15 22:40
8 kB
Kihwal Lee
HDFS-4660.br26.patch
19/Jan/16 16:35
8 kB
Kihwal Lee

Issue Links

is duplicated by

HDFS-10587 Incorrect offset/length calculation in pipeline recovery causes block corruption

Resolved

is related to

HDFS-16601 DataTransfer should throw IOException to Client

Open

relates to

HDFS-10652 Add a unit test for HDFS-4660

Resolved

HDFS-9220 Reading small file (< 512 bytes) that is open for append fails due to incorrect checksum

Closed

Activity

People

Assignee:: Kihwal Lee

Reporter:: Peng Zhang

Votes:: 0 Vote for this issue

Watchers:: 26 Start watching this issue

Dates

Created:: 02/Apr/13 06:51

Updated:: 26/Jul/22 21:15

Resolved:: 16/Jun/15 20:46