Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 2.3.0
    • Component/s: caching, namenode
    • Labels:
      None
    • Target Version/s:

      Description

      This was reported by Chris Nauroth and Brandon Li, and Stephen Chu repro'd it too.

      If you add a new caching directive then remove it, the Namenode will sometimes get stuck in a loop where it sends DNA_CACHE and then DNA_UNCACHE repeatedly to the datanodes where the data was previously cached.

      1. hdfs-5589-2.patch
        19 kB
        Andrew Wang
      2. hdfs-5589-1.patch
        10 kB
        Andrew Wang

        Issue Links

          Activity

          Andrew Wang created issue -
          Andrew Wang made changes -
          Field Original Value New Value
          Assignee Colin Patrick McCabe [ cmccabe ]
          Hide
          Andrew Wang added a comment -

          Perhaps related, but there's an off-by-at-least-one error here when processing block reports. I have a 1 node cluster and added a cache directive with a repl of 3. Saw this log message:

          13/12/04 17:51:39 WARN blockmanagement.CacheReplicationMonitor: We need 1 more replica(s) than actually exist to provide a cache replication of 3 for {blockId=1073741825, replication=3, mark=false}
          

          When I bumped it to 4, it said 2, and at 2 it said 0. My guess is that the pending queue isn't getting cleared properly, leading to the single node getting double counted.

          Show
          Andrew Wang added a comment - Perhaps related, but there's an off-by-at-least-one error here when processing block reports. I have a 1 node cluster and added a cache directive with a repl of 3. Saw this log message: 13/12/04 17:51:39 WARN blockmanagement.CacheReplicationMonitor: We need 1 more replica(s) than actually exist to provide a cache replication of 3 for {blockId=1073741825, replication=3, mark=false} When I bumped it to 4, it said 2, and at 2 it said 0. My guess is that the pending queue isn't getting cleared properly, leading to the single node getting double counted.
          Hide
          Andrew Wang added a comment -

          Patch attached. The cache/uncache bug was that we weren't clearing out blocks that weren't marked during the directive scan. So, an orphan block would retain the old mark and replication factor, and become cached again on the next rescan when the mark flipped to the old value.

          I also incorporated HDFS-5507 (considering stale and capacity when caching), and also fixed another bug I found where we'd try to cache a block again on a node that already had it cached.

          Show
          Andrew Wang added a comment - Patch attached. The cache/uncache bug was that we weren't clearing out blocks that weren't marked during the directive scan. So, an orphan block would retain the old mark and replication factor, and become cached again on the next rescan when the mark flipped to the old value. I also incorporated HDFS-5507 (considering stale and capacity when caching), and also fixed another bug I found where we'd try to cache a block again on a node that already had it cached.
          Andrew Wang made changes -
          Attachment hdfs-5589-1.patch [ 12621222 ]
          Andrew Wang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Assignee Andrew Wang [ andrew.wang ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12621222/hdfs-5589-1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5816//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5816//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12621222/hdfs-5589-1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5816//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5816//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          Good work-- it's great to have finally found that bug.

          +      DatanodeDescriptor datanode = chooseDatanodeForCaching(possibilities,
          +          blockManager.getDatanodeManager().getStaleInterval());
          +      assert (datanode != null);
          

          Don't we need to handle the case where datanode == null here? It seems like that will come up if, for example, we have no space available on any datanode. It might be better to move the code from if (possibilities.isEmpty()) }} if statement into a new check for {{chooseDataNodeForCaching == null. It would be nice to see for this case too-- just fill up the DN caches, or set them to 0...

          Show
          Colin Patrick McCabe added a comment - Good work-- it's great to have finally found that bug. + DatanodeDescriptor datanode = chooseDatanodeForCaching(possibilities, + blockManager.getDatanodeManager().getStaleInterval()); + assert (datanode != null ); Don't we need to handle the case where datanode == null here? It seems like that will come up if, for example, we have no space available on any datanode. It might be better to move the code from if (possibilities.isEmpty()) }} if statement into a new check for {{chooseDataNodeForCaching == null . It would be nice to see for this case too-- just fill up the DN caches, or set them to 0...
          Hide
          Andrew Wang added a comment -

          When we're populating possibilities, we check the DNs for validity, including having enough remaining capacity, so I think this is technically right. I agree though that it reads poorly, so I'll refactor this, and also add a test that tries to cache some big files.

          Show
          Andrew Wang added a comment - When we're populating possibilities , we check the DNs for validity, including having enough remaining capacity, so I think this is technically right. I agree though that it reads poorly, so I'll refactor this, and also add a test that tries to cache some big files.
          Colin Patrick McCabe made changes -
          Link This issue is duplicated by HDFS-5507 [ HDFS-5507 ]
          Hide
          Andrew Wang added a comment -

          New patch rev. I refactored the CRM selection logic some, and also now account for pending cache commands while computing a DN's effective remaining capacity. I also include a new test for caching large files that exceed DN capacity.

          Show
          Andrew Wang added a comment - New patch rev. I refactored the CRM selection logic some, and also now account for pending cache commands while computing a DN's effective remaining capacity. I also include a new test for caching large files that exceed DN capacity.
          Andrew Wang made changes -
          Attachment hdfs-5589-2.patch [ 12621425 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12621425/hdfs-5589-2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5822//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5822//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12621425/hdfs-5589-2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5822//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5822//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          +1. Thanks, Andrew.

          Test failure is unrelated and did not reproduce for me... TestHASafeMode seems to be flaky lately.

          Show
          Colin Patrick McCabe added a comment - +1. Thanks, Andrew. Test failure is unrelated and did not reproduce for me... TestHASafeMode seems to be flaky lately.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #4961 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4961/)
          HDFS-5589. Namenode loops caching and uncaching when data should be uncached. (awang via cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1555996)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #4961 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4961/ ) HDFS-5589 . Namenode loops caching and uncaching when data should be uncached. (awang via cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1555996 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #445 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/445/)
          HDFS-5589. Namenode loops caching and uncaching when data should be uncached. (awang via cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1555996)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #445 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/445/ ) HDFS-5589 . Namenode loops caching and uncaching when data should be uncached. (awang via cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1555996 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1637 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1637/)
          HDFS-5589. Namenode loops caching and uncaching when data should be uncached. (awang via cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1555996)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1637 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1637/ ) HDFS-5589 . Namenode loops caching and uncaching when data should be uncached. (awang via cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1555996 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1662 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1662/)
          HDFS-5589. Namenode loops caching and uncaching when data should be uncached. (awang via cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1555996)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1662 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1662/ ) HDFS-5589 . Namenode loops caching and uncaching when data should be uncached. (awang via cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1555996 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java
          Hide
          Andrew Wang added a comment -

          Resolving as this was committed to trunk.

          Show
          Andrew Wang added a comment - Resolving as this was committed to trunk.
          Andrew Wang made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 3.0.0 [ 12320356 ]
          Resolution Fixed [ 1 ]
          Allen Wittenauer made changes -
          Fix Version/s 2.3.0 [ 12325255 ]
          Fix Version/s 3.0.0 [ 12320356 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Closing tickets that are already part of a release.

          Show
          Vinod Kumar Vavilapalli added a comment - Closing tickets that are already part of a release.
          Vinod Kumar Vavilapalli made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          31d 7h 10m 1 Andrew Wang 03/Jan/14 02:39
          Patch Available Patch Available Resolved Resolved
          6d 22h 38m 1 Andrew Wang 10/Jan/14 01:17
          Resolved Resolved Closed Closed
          536d 6h 7m 1 Vinod Kumar Vavilapalli 30/Jun/15 08:25

            People

            • Assignee:
              Andrew Wang
              Reporter:
              Andrew Wang
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development