Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6784

Avoid rescan twice in HDFS CacheReplicationMonitor for one FS Op if it calls setNeedsRescan multiple times.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: caching
    • Labels:
      None
    • Target Version/s:

      Description

      In HDFS CacheReplicationMonitor, rescan is expensive. Sometimes, setNeedsRescan is called multiple times, for example, in FSNamesystem#modifyCacheDirective, there are 3 times. In monitor thread of CacheReplicationMonitor, if it checks needsRescan is true, rescan will happen, but needsRescan is set to false before real scan. Meanwhile, the 2nd or 3rd time setNeedsResacn may set needsRescan to true. So after the scan finish, in next loop, a new rescan will be triggered, that's not necessary at all and inefficient for rescan twice.

        Activity

        Hide
        Yi Liu added a comment -

        This issue is resolved together in HDFS-6783, so mark it as duplicated.

        Show
        Yi Liu added a comment - This issue is resolved together in HDFS-6783 , so mark it as duplicated.
        Hide
        Yi Liu added a comment -

        Thanks Colin Patrick McCabe, let's wait to see whether we need to handle it separately after HDFS-6783.

        Show
        Yi Liu added a comment - Thanks Colin Patrick McCabe , let's wait to see whether we need to handle it separately after HDFS-6783 .
        Hide
        Colin Patrick McCabe added a comment -

        I thought about this a little more, and I posted a patch on HDFS-6783 that I think solves this problem. Check it out.

        Sorry for all the confusion... sometimes it's tough to reason about these locking issues.

        Show
        Colin Patrick McCabe added a comment - I thought about this a little more, and I posted a patch on HDFS-6783 that I think solves this problem. Check it out. Sorry for all the confusion... sometimes it's tough to reason about these locking issues.
        Hide
        Colin Patrick McCabe added a comment -

        If modifyDirective is caling setNeedsRescan multiple times, each time could trigger a rescan that completes before the next time.

        Sorry, my statement above wasn't quite correct. Since the rescan needs the FSN lock, and the modifyDirective is holding it, the rescan can't complete until the modifyDirective is done.

        Earlier versions of the CRM#rescan code did release the FSN lock intermittently to allow for greater concurrency. We changed it to hold the FSN lock for the full duration for simplicity. But we want to start releasing the lock inside that function in the future, as I've commented earlier. So that's why I don't like this patch... it relies on the FSN lock being held for the whole duration of the rescan.

        Show
        Colin Patrick McCabe added a comment - If modifyDirective is caling setNeedsRescan multiple times, each time could trigger a rescan that completes before the next time. Sorry, my statement above wasn't quite correct. Since the rescan needs the FSN lock, and the modifyDirective is holding it, the rescan can't complete until the modifyDirective is done. Earlier versions of the CRM#rescan code did release the FSN lock intermittently to allow for greater concurrency. We changed it to hold the FSN lock for the full duration for simplicity. But we want to start releasing the lock inside that function in the future, as I've commented earlier. So that's why I don't like this patch... it relies on the FSN lock being held for the whole duration of the rescan.
        Hide
        Colin Patrick McCabe added a comment -

        Ah. You're right, both cases are going FSN -> CRM, not the other way around. That's what I get for commenting so late at night.

        Anyway, this is not the right approach. If modifyDirective is caling setNeedsRescan multiple times, each time could trigger a rescan that completes before the next time. There's no reason to assume that the thread will run after all the calls have been made-- in fact, given the way condition variables work, it's more likely that it will run immediately. In that case, this patch is not useful, even leaving aside any other issues with it. You need to remove the duplicate calls to setNeedsRescan and call it only once.

        Show
        Colin Patrick McCabe added a comment - Ah. You're right, both cases are going FSN -> CRM, not the other way around. That's what I get for commenting so late at night. Anyway, this is not the right approach. If modifyDirective is caling setNeedsRescan multiple times, each time could trigger a rescan that completes before the next time. There's no reason to assume that the thread will run after all the calls have been made-- in fact, given the way condition variables work, it's more likely that it will run immediately. In that case, this patch is not useful, even leaving aside any other issues with it. You need to remove the duplicate calls to setNeedsRescan and call it only once.
        Hide
        Yi Liu added a comment -

        Hi Colin

        This patch adds a place where we hold the CRM lock and try to acquire the FSN lock, which could lead to deadlock.

        It's false, the patch is we hold the FSN lock (rescan method will do this, right?), and then try to get CRM lock to set needsRescan, so it will not lead to deadlock. (Actually I have considered this)

        Show
        Yi Liu added a comment - Hi Colin This patch adds a place where we hold the CRM lock and try to acquire the FSN lock, which could lead to deadlock. It's false, the patch is we hold the FSN lock (rescan method will do this, right?), and then try to get CRM lock to set needsRescan , so it will not lead to deadlock. (Actually I have considered this)
        Hide
        Colin Patrick McCabe added a comment -

        Actually, I just looked at this again, and I see a bigger problem. We have methods in CacheManager holding the FSN lock and acquiring the CRM lock. This patch adds a place where we hold the CRM lock and try to acquire the FSN lock, which could lead to deadlock.

        Like I said earlier, I have a hunch that the right way to go is to avoid unnecessary calls to setNeedsRescan...

        Show
        Colin Patrick McCabe added a comment - Actually, I just looked at this again, and I see a bigger problem. We have methods in CacheManager holding the FSN lock and acquiring the CRM lock. This patch adds a place where we hold the CRM lock and try to acquire the FSN lock, which could lead to deadlock. Like I said earlier, I have a hunch that the right way to go is to avoid unnecessary calls to setNeedsRescan...
        Hide
        Yi Liu added a comment -

        Thanks Colin Patrick McCabe

        If a needsRescan arrives while a scan is currently in progress

        But I have checked in current implementation, all places which call setNeedsRescan are protected by FSN lock, so if a scan is currently in progress, there is no needsRescan arrives.

        I think the solution is to minimize the number of times we call setNeedsRescan from these functions.

        Agree this is one approach, I thought if we could resolve this from the root it's better.

        Show
        Yi Liu added a comment - Thanks Colin Patrick McCabe If a needsRescan arrives while a scan is currently in progress But I have checked in current implementation, all places which call setNeedsRescan are protected by FSN lock, so if a scan is currently in progress, there is no needsRescan arrives. I think the solution is to minimize the number of times we call setNeedsRescan from these functions. Agree this is one approach, I thought if we could resolve this from the root it's better.
        Hide
        Colin Patrick McCabe added a comment -

        This isn't quite right. If a needsRescan arrives while a scan is currently in progress, we do need to trigger another scan, since there may be updates that the current scan doesn't reflect.

        for example, in FSNamesystem#modifyCacheDirective, there are 3 times [that setNeedsRescan is called]

        I think the solution is to minimize the number of times we call setNeedsRescan from these functions.

        Show
        Colin Patrick McCabe added a comment - This isn't quite right. If a needsRescan arrives while a scan is currently in progress, we do need to trigger another scan, since there may be updates that the current scan doesn't reflect. for example, in FSNamesystem#modifyCacheDirective, there are 3 times [that setNeedsRescan is called] I think the solution is to minimize the number of times we call setNeedsRescan from these functions.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12658640/HDFS-6784.001.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7498//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7498//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658640/HDFS-6784.001.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7498//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7498//console This message is automatically generated.
        Hide
        Yi Liu added a comment -

        We should set the needsRescan when real rescan starts and protected by FSNamesystem lock.

        Show
        Yi Liu added a comment - We should set the needsRescan when real rescan starts and protected by FSNamesystem lock.

          People

          • Assignee:
            Yi Liu
            Reporter:
            Yi Liu
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development