Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10220

A large number of expired leases can make namenode unresponsive and cause failover

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Two new configuration have been added "dfs.namenode.lease-recheck-interval-ms" and "dfs.namenode.max-lock-hold-to-release-lease-ms" to fine tune the duty cycle with which the Namenode recovers old leases.

      Description

      I have faced a namenode failover due to unresponsive namenode detected by the zkfc with lot's of WARN messages (5 millions) like this one:
      org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file closed.

      On the threaddump taken by the zkfc there are lots of thread blocked due to a lock.

      Looking at the code, there are a lock taken by the LeaseManager.Monitor when some lease must be released. Due to the really big number of lease to be released the namenode has taken too many times to release them blocking all other tasks and making the zkfc thinking that the namenode was not available/stuck.

      The idea of this patch is to limit the number of leased released each time we check for lease so the lock won't be taken for a too long time period.

        Attachments

        1. HADOOP-10220.007.patch
          13 kB
          Nicolas Fraison
        2. HADOOP-10220.006.patch
          8 kB
          Nicolas Fraison
        3. HADOOP-10220.005.patch
          8 kB
          Nicolas Fraison
        4. HADOOP-10220.004.patch
          11 kB
          Nicolas Fraison
        5. HADOOP-10220.003.patch
          11 kB
          Nicolas Fraison
        6. HADOOP-10220.002.patch
          10 kB
          Nicolas Fraison
        7. threaddump_zkfc.txt
          809 kB
          Nicolas Fraison
        8. HADOOP-10220.001.patch
          9 kB
          Nicolas Fraison

          Issue Links

            Activity

              People

              • Assignee:
                nfraison.criteo Nicolas Fraison
                Reporter:
                nfraison.criteo Nicolas Fraison
              • Votes:
                0 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: