Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6786

Fix potential issue of cache refresh interval

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: caching
    • Labels:
      None
    • Target Version/s:

      Description

      In CacheReplicationMonitor, following code is try to check whether needs to rescan every interval ms, if rescan takes n ms, then subtract n ms for the interval. But if the delta <=0, it breaks and start rescan, there will be potential issue: if user set the interval to a small value or rescan finished after a while exceeding the interval, then rescan will happen in loop. Furthermore, delta <= 0 trigger the rescan should not be the intention, since if needs rescan, needsRescan will be set.

               while (true) {
                  if (shutdown) {
                    LOG.info("Shutting down CacheReplicationMonitor");
                    return;
                  }
                  if (needsRescan) {
                    LOG.info("Rescanning because of pending operations");
                    break;
                  }
                  long delta = (startTimeMs + intervalMs) - curTimeMs;
                  if (delta <= 0) {
                    LOG.info("Rescanning after " + (curTimeMs - startTimeMs) +
                        " milliseconds");
                    break;
                  }
                  doRescan.await(delta, TimeUnit.MILLISECONDS);
                  curTimeMs = Time.monotonicNow();
                }
      

        Activity

        Hide
        Colin Patrick McCabe added a comment -

        I hope I'm not misunderstanding this, but I don't see any bug here. If you set a refresh interval so low that rescans happen continuously, that's the intended behavior? And a good reason not to do that? Not all scans are triggered by needsRescan... that's just the "fast track"

        Show
        Colin Patrick McCabe added a comment - I hope I'm not misunderstanding this, but I don't see any bug here. If you set a refresh interval so low that rescans happen continuously, that's the intended behavior? And a good reason not to do that? Not all scans are triggered by needsRescan ... that's just the "fast track"
        Hide
        Yi Liu added a comment -

        Thanks Colin Patrick McCabe, I found I had a misunderstanding about the interval so later I confirmed with you offline.

        I still have concern as the front part of the description: if there are a lot of HDFS catch, rescan will be expensive, right? (There are two much stuff in the rescan, cachedBlocks is very large). The default interval value is 30s, and it will subtract the rescan time and the time of ops if other threads jump into. A worse case is if user set the interval to a more smaller value, then the thread will fall into rescan loop? Rescan will hold FSN lock, won't that affect NN performance?

        Show
        Yi Liu added a comment - Thanks Colin Patrick McCabe , I found I had a misunderstanding about the interval so later I confirmed with you offline. I still have concern as the front part of the description: if there are a lot of HDFS catch, rescan will be expensive, right? (There are two much stuff in the rescan, cachedBlocks is very large). The default interval value is 30s, and it will subtract the rescan time and the time of ops if other threads jump into. A worse case is if user set the interval to a more smaller value, then the thread will fall into rescan loop? Rescan will hold FSN lock, won't that affect NN performance?
        Hide
        Colin Patrick McCabe added a comment -

        I'm not aware of any real clusters where rescan time is an issue right now. Just to give some rough numbers, if you have 100 gigs used for HDFS cache, and blocks of size 128 MB, you'd have 800 cached blocks per DN at present. That's pretty manageable. Even with 100 nodes, you still only have 80,000 cached blocks to look at... certainly not something that would take anywhere near 30 seconds (unless a major GC hits).

        There's two directions we could go in optimizing this. One is to avoid holding the lock during the entire cache rescan. This is something we talked about, but didn't quite get around to implementing since it makes some things a lot trickier. We'd have to have some way of handling the list of cached blocks changing in between times we grabbed the lock.

        Another is to be more "incremental." Operations that renamed, deleted, or added files could check with the cache manager to see if their actions modified the cache values, and update the cache then. This seems easy, but is actually very difficult. We'd have to hook into almost every FSNamesystem operation. We'd also have to figure out how to make decisions "incrementally" which is not easy. I think even with an incremental system, we'd want to keep the rescan around as a backstop against any bugs in the incremental accounting. Maybe it would run much less frequently, like every 15 minutes or so.

        Show
        Colin Patrick McCabe added a comment - I'm not aware of any real clusters where rescan time is an issue right now. Just to give some rough numbers, if you have 100 gigs used for HDFS cache, and blocks of size 128 MB, you'd have 800 cached blocks per DN at present. That's pretty manageable. Even with 100 nodes, you still only have 80,000 cached blocks to look at... certainly not something that would take anywhere near 30 seconds (unless a major GC hits). There's two directions we could go in optimizing this. One is to avoid holding the lock during the entire cache rescan. This is something we talked about, but didn't quite get around to implementing since it makes some things a lot trickier. We'd have to have some way of handling the list of cached blocks changing in between times we grabbed the lock. Another is to be more "incremental." Operations that renamed, deleted, or added files could check with the cache manager to see if their actions modified the cache values, and update the cache then. This seems easy, but is actually very difficult. We'd have to hook into almost every FSNamesystem operation. We'd also have to figure out how to make decisions "incrementally" which is not easy. I think even with an incremental system, we'd want to keep the rescan around as a backstop against any bugs in the incremental accounting. Maybe it would run much less frequently, like every 15 minutes or so.
        Hide
        Yi Liu added a comment -

        How about set the interval to a large value, 5 minutes or 15 minutes as you said. And we define a minimum interval value (e.g. 30s) to protect against user misconfigure and affect NN's performance but without warning.

        Show
        Yi Liu added a comment - How about set the interval to a large value, 5 minutes or 15 minutes as you said. And we define a minimum interval value (e.g. 30s) to protect against user misconfigure and affect NN's performance but without warning.
        Hide
        Andrew Wang added a comment -

        The downside to increasing the rescan interval is potentially waiting minutes to cache or recache things. As Colin already noted, rescans should be pretty cheap even up to 10TB of cache, which is a pretty big deployment.

        I think the bigger issue is the stop-the-world nature of rescans. We trigger rescans when the list of cache directives changes, so if someone is actually running a giant cluster where rescan time is significant, the # of rescans triggered by the interval will probably be dominated by the # triggered by a cache directive change.

        I'm also not sure about the utility of a minimum here. Considering that we think rescan time will be reasonable for most clusters, most people will never need to touch this config parameter at all. If we did define such a minimum, we'd probably also need a backdoor for the unit tests, since it's useful to set a low rescan interval there.

        Show
        Andrew Wang added a comment - The downside to increasing the rescan interval is potentially waiting minutes to cache or recache things. As Colin already noted, rescans should be pretty cheap even up to 10TB of cache, which is a pretty big deployment. I think the bigger issue is the stop-the-world nature of rescans. We trigger rescans when the list of cache directives changes, so if someone is actually running a giant cluster where rescan time is significant, the # of rescans triggered by the interval will probably be dominated by the # triggered by a cache directive change. I'm also not sure about the utility of a minimum here. Considering that we think rescan time will be reasonable for most clusters, most people will never need to touch this config parameter at all. If we did define such a minimum, we'd probably also need a backdoor for the unit tests, since it's useful to set a low rescan interval there.
        Hide
        Colin Patrick McCabe added a comment -

        I agree with Andrew that a minimum interval seems unnecessary here. Sysadmins rarely adjust this value, and if they do, we assume that they know what they're doing... similar to a lot of the other tunables.

        The only case I can recall where we set a minimum is in block size. But we did that because block size can be controlled by the client creating a file (you don't need to be the sysadmin adjusting a configuration to set the block size).

        I'm going to close this one out since it's working as intended.

        Show
        Colin Patrick McCabe added a comment - I agree with Andrew that a minimum interval seems unnecessary here. Sysadmins rarely adjust this value, and if they do, we assume that they know what they're doing... similar to a lot of the other tunables. The only case I can recall where we set a minimum is in block size. But we did that because block size can be controlled by the client creating a file (you don't need to be the sysadmin adjusting a configuration to set the block size). I'm going to close this one out since it's working as intended.
        Hide
        Yi Liu added a comment -

        Thanks Andrew Wang and Colin Patrick McCabe, let's keep as what it was.

        Show
        Yi Liu added a comment - Thanks Andrew Wang and Colin Patrick McCabe , let's keep as what it was.

          People

          • Assignee:
            Yi Liu
            Reporter:
            Yi Liu
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development