Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None

      Description

      we found that the Datanode doesn't do a graceful shutdown and a block can be corrupted (data + checksum amounts off)

      we can make the DN do a graceful shutdown in case there are open files. if this presents a problem to a timely shutdown, we can make a it a parameter of how long to wait for the full graceful shutdown before just exiting

        Activity

        Hide
        sam rash added a comment -

        My understanding of how lease recovery works in 20-append is that on cluster restart, an open file will be recovered by the Namenode. Datanodes will send the longest valid length of the block (ie, if there are 8 bytes of checksum and 1500 data, the valid length is 1024 assuming 512 byte chunk size). The block is then truncated to a valid length.

        20-append seems to have a bug that any block where data + checksum length don't match, the block isn't use in lease recovery. the work here might be to fix that?

        Show
        sam rash added a comment - My understanding of how lease recovery works in 20-append is that on cluster restart, an open file will be recovered by the Namenode. Datanodes will send the longest valid length of the block (ie, if there are 8 bytes of checksum and 1500 data, the valid length is 1024 assuming 512 byte chunk size). The block is then truncated to a valid length. 20-append seems to have a bug that any block where data + checksum length don't match, the block isn't use in lease recovery. the work here might be to fix that?
        Hide
        dhruba borthakur added a comment -

        +1 for fixing 0.20-append: if data and checksum length don't match on a datanode then truncate appropriate file and then make this replica participate in lease recovery.

        Show
        dhruba borthakur added a comment - +1 for fixing 0.20-append: if data and checksum length don't match on a datanode then truncate appropriate file and then make this replica participate in lease recovery.
        Hide
        Todd Lipcon added a comment -

        I believe the way 20-append works is that it won't use that replica if there are other replicas where the checksum and data length do match up. If all of them are incorrect, it falls back to the old behavior of truncating. If we're too aggressive with the truncation, we risk truncating back before the last sync() offset, I think. HDFS-1218 changes the behavior here a bit, though - I think that's one of the patches that hasn't made it into the branch yet?

        Show
        Todd Lipcon added a comment - I believe the way 20-append works is that it won't use that replica if there are other replicas where the checksum and data length do match up. If all of them are incorrect, it falls back to the old behavior of truncating. If we're too aggressive with the truncation, we risk truncating back before the last sync() offset, I think. HDFS-1218 changes the behavior here a bit, though - I think that's one of the patches that hasn't made it into the branch yet?
        Hide
        sam rash added a comment -

        I saw the case of a single replica existing that did not have a matching data + checksum length. it was not used and we lost the block. i need to double-check the code to see, but the DN exception was that the block was not valid and couldn't be used

        it seems to me the logic is simple: take the longest length you can get. It doesn't matter if data and checksum match as far as I can tell (though I think typically matching => longer than unmatching).

        truncation only happens after the NN picks the length of the blocks. as I said, I think the bug, at least in our patched rev (need to look at stock 20-append), is that mismatching lengths can't participate at all in lease recovery which seems broken

        Show
        sam rash added a comment - I saw the case of a single replica existing that did not have a matching data + checksum length. it was not used and we lost the block. i need to double-check the code to see, but the DN exception was that the block was not valid and couldn't be used it seems to me the logic is simple: take the longest length you can get. It doesn't matter if data and checksum match as far as I can tell (though I think typically matching => longer than unmatching). truncation only happens after the NN picks the length of the blocks. as I said, I think the bug, at least in our patched rev (need to look at stock 20-append), is that mismatching lengths can't participate at all in lease recovery which seems broken
        Hide
        sam rash added a comment -

        actually sorry, i remember your patch to do this. I think the rev I'm using internally is older--i will check 20-append as well. This problem may not appear in 20-append with all the latest patches.

        the question still remains if there is benefit in making the datanode do a clean shutdown on the DataXceiver threads.

        Show
        sam rash added a comment - actually sorry, i remember your patch to do this. I think the rev I'm using internally is older--i will check 20-append as well. This problem may not appear in 20-append with all the latest patches. the question still remains if there is benefit in making the datanode do a clean shutdown on the DataXceiver threads.
        Hide
        sam rash added a comment -

        actually, it's in our latest branch here which is >= 20-append and includes your patch. The problem is that getBlockMetaDataInfo() has this at the end:

        
            // paranoia! verify that the contents of the stored block
            // matches the block file on disk.
            data.validateBlockMetadata(stored);
        

        which includes this check:

        
            if (f.length() > maxDataSize || f.length() <= minDataSize) {
              throw new IOException("Block " + b +
                                    " is of size " + f.length() +
                                    " but has " + (numChunksInMeta + 1) +
                                    " checksums and each checksum size is " +
                                    checksumsize + " bytes.");
            }
        

        a block is not allowed to participate in lease recovery if this fails.

        Show
        sam rash added a comment - actually, it's in our latest branch here which is >= 20-append and includes your patch. The problem is that getBlockMetaDataInfo() has this at the end: // paranoia! verify that the contents of the stored block // matches the block file on disk. data.validateBlockMetadata(stored); which includes this check: if (f.length() > maxDataSize || f.length() <= minDataSize) { throw new IOException( "Block " + b + " is of size " + f.length() + " but has " + (numChunksInMeta + 1) + " checksums and each checksum size is " + checksumsize + " bytes." ); } a block is not allowed to participate in lease recovery if this fails.
        Hide
        Thanh Do added a comment -

        Sam, can you explain what do you mean by "gracefully shutdown",
        and "clean shutdowns on DataXceiver threads".

        If we don't do a clean shut down, client will
        trigger a pipeline recovery and exclude that data node,
        and isn't this still fine.

        Thanks

        Show
        Thanh Do added a comment - Sam, can you explain what do you mean by "gracefully shutdown", and "clean shutdowns on DataXceiver threads". If we don't do a clean shut down, client will trigger a pipeline recovery and exclude that data node, and isn't this still fine. Thanks
        Hide
        Allen Wittenauer added a comment -

        Based upon context, this appears to have been a 0.20-append issue... and as such, append got overhauled in 2.x. Closing at won't fix.

        Show
        Allen Wittenauer added a comment - Based upon context, this appears to have been a 0.20-append issue... and as such, append got overhauled in 2.x. Closing at won't fix.

          People

          • Assignee:
            sam rash
            Reporter:
            sam rash
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development