Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 0.23.4, 2.0.2-alpha
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      When a large number of files are abandoned without closing, a storm of lease expiration follows in about an hour (lease hard limit). For the last block of each file, block recovery is initiated and when the datanode is done, it calls commitBlockSynchronization() is called against namenode. A burst of these calls can slow down namenode considerably. We need to throttle block recovery and/or speed up the rate at which commitBlockSynchronization() is served.

        Issue Links

          Activity

          Hide
          kihwal Kihwal Lee added a comment -

          I think the overhead of block recoveries caused by massive lease expiration will drastically reduce after HDFS-5790.

          Show
          kihwal Kihwal Lee added a comment - I think the overhead of block recoveries caused by massive lease expiration will drastically reduce after HDFS-5790 .
          Hide
          kihwal Kihwal Lee added a comment -

          So it looks like logSync() is done holding the write lock during lease release & scheduling block recovery. We should probably hold off this jira and fix that first.

          Filed HDFS-4186.

          Show
          kihwal Kihwal Lee added a comment - So it looks like logSync() is done holding the write lock during lease release & scheduling block recovery. We should probably hold off this jira and fix that first. Filed HDFS-4186 .
          Hide
          kihwal Kihwal Lee added a comment -

          Could we simply jitter the lease expiration time? The 1 hour limit isn't anything precise which a user would depend on. I could see adding a 10% jitter to smooth out the expirations over time.

          If it is a short burst, jitter will scatter the burst. But if it's a long burst, this won't help much. In some cases, over 40K block recoveries were still outstanding after an hour and caused the leases held by namenode during recovery to expire, resulting in yet another storm of block recoveries.

          Any idea why it's so slow aside from the logging? Do we have a silly bug where we are logSyncing while holding the lock or something?

          I couldn't find anything in commitBlockSynchronization(), but releasing lease and initiating block recover may be the source of problem.

          internalReleaseLease() : this is called from the lease monitor with the write lock held
          -> reassignLease()
          -> logReassignLease() : reacquires the write lock and logSync() happens.

          So it looks like logSync() is done holding the write lock during lease release & scheduling block recovery.

          We should probably hold off this jira and fix that first.

          Show
          kihwal Kihwal Lee added a comment - Could we simply jitter the lease expiration time? The 1 hour limit isn't anything precise which a user would depend on. I could see adding a 10% jitter to smooth out the expirations over time. If it is a short burst, jitter will scatter the burst. But if it's a long burst, this won't help much. In some cases, over 40K block recoveries were still outstanding after an hour and caused the leases held by namenode during recovery to expire, resulting in yet another storm of block recoveries. Any idea why it's so slow aside from the logging? Do we have a silly bug where we are logSyncing while holding the lock or something? I couldn't find anything in commitBlockSynchronization(), but releasing lease and initiating block recover may be the source of problem. internalReleaseLease() : this is called from the lease monitor with the write lock held -> reassignLease() -> logReassignLease() : reacquires the write lock and logSync() happens. So it looks like logSync() is done holding the write lock during lease release & scheduling block recovery. We should probably hold off this jira and fix that first.
          Hide
          atm Aaron T. Myers added a comment -

          I personally prefer adding jitter to the lease expiration times. The trouble with throttling lease recovery is that if we choose a value that's too low then over time outstanding leases will build up in the NN if the rate of abandoned leases exceeds the throttling rate.

          Show
          atm Aaron T. Myers added a comment - I personally prefer adding jitter to the lease expiration times. The trouble with throttling lease recovery is that if we choose a value that's too low then over time outstanding leases will build up in the NN if the rate of abandoned leases exceeds the throttling rate.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          This problem looks very similar to the large delete problem: the LeaseManager recovers all expired leases at the same time. We could limit the number of leases in each iteration and then we are done since the thread sleeps for NAMENODE_LEASE_RECHECK_INTERVAL between iterations.

          > ... I propose configurable rate with 300/min as default.

          I think it is too small but I don't have experimental data to support my argument.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - This problem looks very similar to the large delete problem: the LeaseManager recovers all expired leases at the same time. We could limit the number of leases in each iteration and then we are done since the thread sleeps for NAMENODE_LEASE_RECHECK_INTERVAL between iterations. > ... I propose configurable rate with 300/min as default. I think it is too small but I don't have experimental data to support my argument.
          Hide
          tlipcon Todd Lipcon added a comment -

          A few thoughts:
          1) Could we simply jitter the lease expiration time? The 1 hour limit isn't anything precise which a user would depend on. I could see adding a 10% jitter to smooth out the expirations over time.
          2) Any idea why it's so slow aside from the logging? Do we have a silly bug where we are logSyncing while holding the lock or something?

          Show
          tlipcon Todd Lipcon added a comment - A few thoughts: 1) Could we simply jitter the lease expiration time? The 1 hour limit isn't anything precise which a user would depend on. I could see adding a 10% jitter to smooth out the expirations over time. 2) Any idea why it's so slow aside from the logging? Do we have a silly bug where we are logSyncing while holding the lock or something?
          Hide
          kihwal Kihwal Lee added a comment -

          what is large number of files? few hundred?

          10s of thousand. Since this is only noticed in a large clusters, it might be better to disable throttling by default.

          Show
          kihwal Kihwal Lee added a comment - what is large number of files? few hundred? 10s of thousand. Since this is only noticed in a large clusters, it might be better to disable throttling by default.
          Hide
          hsn Radim Kolar added a comment -

          what is large number of files? few hundred?

          Show
          hsn Radim Kolar added a comment - what is large number of files? few hundred?
          Hide
          kihwal Kihwal Lee added a comment -

          We've seen a busrt of commitBlockSynchronization() calls making namenode unresponsive for a long time, causing other important RPC calls such as leas renewing and heartbeat to fail. Since the blocks are copied, it can also create a lot of cluster-wide traffic.

          The commitBlockSynchronization() method logs two messages, one in the beginning after acquiring the write lock and another one after releasing and syncing the edit log. The time between the two is usually less than 1-2 ms, so the actual processing and sync time don't seem long. But when namenode gets a busrt of these calls, it can only sustain the rate of 20-30 per second, with almost no other requests being served. When these calls are served back-to-back, the gap between calls ranges from 20-100ms.

          The calls are supposed to be blocked at the write lock. Although enabling fairness is known to causes significant performance degradation on write heavy ReadWriteLock (in my experiment about 80% degradation with 100 threads), the overhead is still very small compared to the wait time of 20-100ms we saw.

          Regardless of the performance and efficiency of commitBlockSynchronization(), I think it is reasonable to throttle the block recovery, so that namenode can avoid shooting itself. It will be nice to have a feedback-based dynamic asynchronous work scheduling, but a simple throttling may do for now. I propose configurable rate with 300/min as default.

          Show
          kihwal Kihwal Lee added a comment - We've seen a busrt of commitBlockSynchronization() calls making namenode unresponsive for a long time, causing other important RPC calls such as leas renewing and heartbeat to fail. Since the blocks are copied, it can also create a lot of cluster-wide traffic. The commitBlockSynchronization() method logs two messages, one in the beginning after acquiring the write lock and another one after releasing and syncing the edit log. The time between the two is usually less than 1-2 ms, so the actual processing and sync time don't seem long. But when namenode gets a busrt of these calls, it can only sustain the rate of 20-30 per second, with almost no other requests being served. When these calls are served back-to-back, the gap between calls ranges from 20-100ms. The calls are supposed to be blocked at the write lock. Although enabling fairness is known to causes significant performance degradation on write heavy ReadWriteLock (in my experiment about 80% degradation with 100 threads), the overhead is still very small compared to the wait time of 20-100ms we saw. Regardless of the performance and efficiency of commitBlockSynchronization(), I think it is reasonable to throttle the block recovery, so that namenode can avoid shooting itself. It will be nice to have a feedback-based dynamic asynchronous work scheduling, but a simple throttling may do for now. I propose configurable rate with 300/min as default.

            People

            • Assignee:
              kihwal Kihwal Lee
              Reporter:
              kihwal Kihwal Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development