Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7809

Block and lease recovery failure caused by snapshot issue

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 2.5.0
    • None
    • None
    • None

    Description

      On a cluster running 2.5, we have observed a decommissioning failure due to a file that had been under construction for 3 days. It turned out that the file was abandoned and a lease recovery was carried out by the name node 3 days ago.

      The block recovery failed because the name node threw a quota exception while serving commitBlockSynchronization(). After this failure, no further attempt for recovery was made, leaving the file in under-construction state forever.

      Furthermore, the nature of the recovery failure is very strange. Even though snapshot was never used in the cluster, it was trying to record the diff and that required incrementing nsquota by 1. The user happened to ran out of his nsquota at that time, so it failed and caused commitBlockSynchronization() to fail. We do see quota discrepancies occasionally. Probably those were caused by something like this all along?

      Few observations:

      • Lease recovery did not complete, yet didn't get retried.
      • No snapshot was in use, but somehow it went through snapshot-related code path.
      • quota update during commitBlockSynchronization() should be done unconditionally.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kihwal Kihwal Lee
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: