Hadoop Common
  1. Hadoop Common
  2. HADOOP-8292

TableMapping does not refresh when topology is updated

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      HADOOP-7030 introduced TableMapping, an implementation of DNSToSwitchMapping which uses a file to map from IPs/hosts to their racks. It's intended to replace ScriptBasedMapping for cases where the latter was just a complicated way of looking up the rack in a file.

      Though there was discussion of it on the JIRA, the TableMapping implementation is not 'refreshable'. i.e., if you want to add a host to your cluster, and that host wasn't in the topology file to begin with, it will never be added.

      TableMapping should refresh, either based on a command that can be executed, or, perhaps, if the file on disk changes.

      I'll also point out that TableMapping extends CachedDNSToSwitchMapping, but, since it does no refreshing, I don't see what the caching gets you: I think the cache ends up being a second copy of the underlying map, always.

      1. HADOOP-8292.patch
        10 kB
        Alejandro Abdelnur
      2. HADOOP-8292.patch
        8 kB
        Alejandro Abdelnur

        Activity

        Hide
        Alejandro Abdelnur added a comment -

        adding a lazy reload check on resolve(), the reload check will check the mapping file for changes using a configurable minimum interval (default value set to 10 secs).

        Show
        Alejandro Abdelnur added a comment - adding a lazy reload check on resolve(), the reload check will check the mapping file for changes using a configurable minimum interval (default value set to 10 secs).
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12526085/HADOOP-8292.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common:

        org.apache.hadoop.fs.viewfs.TestViewFsTrash

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/963//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/963//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526085/HADOOP-8292.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common: org.apache.hadoop.fs.viewfs.TestViewFsTrash +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/963//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/963//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Few quick notes (didn't look in serious detail yet):

        • is it OK to be doing file system access while holding this lock, and in the "hot path" of resolve? I worry that this might slow down client requests, for example.
        • I think we should avoid reading the file if the modification time is within the last couple of seconds – with some editors and config management systems, updating a file might temporarily leave it in an empty state before re-filling it again with the new data. Well behaved systems won't do that, but I think it's better for us to be resilient to it than for us to end up loading an empty topology mapping.
        • clearCache() is called from resolve() without the lock held. That might cause multiple threads to call clear() on a map at once, which might result in an exception or something.

        How does this interact with the HDFS topology code which needs to check when a cluster changes from single-rack to multi-rack? When a node's topology changes, don't we need to re-check replication policies for all the blocks, etc? Maybe this isn't a new issue, but it's certainly strange.

        Show
        Todd Lipcon added a comment - Few quick notes (didn't look in serious detail yet): is it OK to be doing file system access while holding this lock, and in the "hot path" of resolve? I worry that this might slow down client requests, for example. I think we should avoid reading the file if the modification time is within the last couple of seconds – with some editors and config management systems, updating a file might temporarily leave it in an empty state before re-filling it again with the new data. Well behaved systems won't do that, but I think it's better for us to be resilient to it than for us to end up loading an empty topology mapping. clearCache() is called from resolve() without the lock held. That might cause multiple threads to call clear() on a map at once, which might result in an exception or something. How does this interact with the HDFS topology code which needs to check when a cluster changes from single-rack to multi-rack? When a node's topology changes, don't we need to re-check replication policies for all the blocks, etc? Maybe this isn't a new issue, but it's certainly strange.
        Hide
        Alejandro Abdelnur added a comment -

        new patch addressing Todd's comments.

        the new patch removes the 'hot path' by using a daemon thread that does the check/reload and swapping the Map using an AtomicRef. It does a 2 seconds lag check as you suggested. Regarding the clearCache(), the underlaying impl is a ConcurrentMap, so that would not be an issue (still, now the clearcache is called after swapping the new Map in the daemon thread).

        Regarding your last question on how it may impact HDFS, I don't know I'll dig.

        Show
        Alejandro Abdelnur added a comment - new patch addressing Todd's comments. the new patch removes the 'hot path' by using a daemon thread that does the check/reload and swapping the Map using an AtomicRef. It does a 2 seconds lag check as you suggested. Regarding the clearCache(), the underlaying impl is a ConcurrentMap, so that would not be an issue (still, now the clearcache is called after swapping the new Map in the daemon thread). Regarding your last question on how it may impact HDFS, I don't know I'll dig.
        Hide
        Alejandro Abdelnur added a comment -

        Forgot to mention, I'm using a daemon thread with a while(true) loop because the DNSToSwitchMapping interface does not have lifecycle methods (init/destroy) I could use to shutdown the thread. This could be done, but I'd argue as part of another JIRA

        Show
        Alejandro Abdelnur added a comment - Forgot to mention, I'm using a daemon thread with a while(true) loop because the DNSToSwitchMapping interface does not have lifecycle methods (init/destroy) I could use to shutdown the thread. This could be done, but I'd argue as part of another JIRA
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12526177/HADOOP-8292.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common:

        org.apache.hadoop.fs.viewfs.TestViewFsTrash

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/967//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/967//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526177/HADOOP-8292.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common: org.apache.hadoop.fs.viewfs.TestViewFsTrash +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/967//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/967//console This message is automatically generated.
        Hide
        Harsh J added a comment -
        +  protected void clearCache() {
        +    cache.clear();
        +  }
        

        Is there a way to also "invoke" a cache clearance dynamically (say via a dfsadmin command)? Should we have one?

        Patch appears alright to me. Mind checking it once again and rebasing?

        Todd's question:

        How does this interact with the HDFS topology code which needs to check when a cluster changes from single-rack to multi-rack? When a node's topology changes, don't we need to re-check replication policies for all the blocks, etc? Maybe this isn't a new issue, but it's certainly strange.

        Currently the fsck begins warning as soon as the mapping changes in its perspective (when it checks for it). Things such as balancer, etc. go crazy after that but doesn't cause a downtime as such. Perhaps this needs to be discussed in a separate JIRA?

        Show
        Harsh J added a comment - + protected void clearCache() { + cache.clear(); + } Is there a way to also "invoke" a cache clearance dynamically (say via a dfsadmin command)? Should we have one? Patch appears alright to me. Mind checking it once again and rebasing? Todd's question: How does this interact with the HDFS topology code which needs to check when a cluster changes from single-rack to multi-rack? When a node's topology changes, don't we need to re-check replication policies for all the blocks, etc? Maybe this isn't a new issue, but it's certainly strange. Currently the fsck begins warning as soon as the mapping changes in its perspective (when it checks for it). Things such as balancer, etc. go crazy after that but doesn't cause a downtime as such. Perhaps this needs to be discussed in a separate JIRA?
        Hide
        Alejandro Abdelnur added a comment -

        canceling patch for now as it may need more discussion.

        Show
        Alejandro Abdelnur added a comment - canceling patch for now as it may need more discussion.

          People

          • Assignee:
            Alejandro Abdelnur
            Reporter:
            Philip Zeyliger
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:

              Development