Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4660

Block corruption can happen during pipeline recovery

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.3-alpha, 3.0.0-alpha1
    • Fix Version/s: 2.8.0, 2.7.1, 2.6.4, 3.0.0-alpha1
    • Component/s: datanode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      pipeline DN1 DN2 DN3
      stop DN2

      pipeline added node DN4 located at 2nd position
      DN1 DN4 DN3

      recover RBW
      DN4 after recover rbw
      2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
      2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
      getNumBytes() = 134144
      getBytesOnDisk() = 134144
      getVisibleLength()= 134144
      end at chunk (134144/512=262)

      DN3 after recover rbw
      2013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
      getNumBytes() = 134028
      getBytesOnDisk() = 134028
      getVisibleLength()= 134028

      client send packet after recover pipeline
      offset=133632 len=1008

      DN4 after flush
      2013-04-01 21:02:31,779 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1063
      // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 1063.

      DN3 after flush
      2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, lastPacketInBlock=false, offsetInBlock=134640, ackEnqueueNanoTime=8817026136871545)
      2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing meta file offset of block BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 1055 to 1051
      2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1059

      After checking meta on DN4, I found checksum of chunk 262 is duplicated, but data not.
      Later after block was finalized, DN4's scanner detected bad block, and then reported it to NM. NM send a command to delete this block, and replicate this block from other DN in pipeline to satisfy duplication num.

      I think this is because in BlockReceiver it skips data bytes already written, but not skips checksum bytes already written. And function adjustCrcFilePosition is only used for last non-completed chunk, but
      not for this situation.

        Attachments

        1. HDFS-4660.br26.patch
          8 kB
          Kihwal Lee
        2. HDFS-4660.patch
          8 kB
          Kihwal Lee
        3. HDFS-4660.patch
          2 kB
          Peng Zhang
        4. HDFS-4660.v2.patch
          8 kB
          Kihwal Lee
        5. periodic_hflush.patch
          3 kB
          Nathan Roberts

          Issue Links

            Activity

              People

              • Assignee:
                kihwal Kihwal Lee
                Reporter:
                peng.zhang Peng Zhang
              • Votes:
                0 Vote for this issue
                Watchers:
                27 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: