HBase
  1. HBase
  2. HBASE-5163

TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.92.0, 0.94.0
    • Component/s: test
    • Labels:
      None
    • Environment:

      all

    • Hadoop Flags:
      Reviewed

      Description

      The stack is typically:

          <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
      	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
      	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
      	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
      	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
      	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
      	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
              // ...
      

      It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.

      This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes currently in the cluster, and does not take into account previous starts/stops:

         for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
            if (manageDfsDirs) {
              File dir1 = new File(data_dir, "data"+(2*i+1));
              File dir2 = new File(data_dir, "data"+(2*i+2));
              dir1.mkdirs();
              dir2.mkdirs();
            // [...]
      

      This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:

        @Test
        public void testMiniDFSCluster_startDataNode() throws Exception {
      
          assertTrue( dfsCluster.getDataNodes().size() == 2 );
      
      
          // Works, as we kill the last datanode, we can now start a datanode
          dfsCluster.stopDataNode(1);
          dfsCluster
            .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
      
      
          // Fails, as it's not the last datanode, the directory will conflict on
          //  creation
          dfsCluster.stopDataNode(0);
          try {
            dfsCluster
              .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
            fail("There should be an exception because the directory already exists");
          } catch (IOException e) {
            assertTrue( e.getMessage().contains("The directory is already locked."));
            LOG.info("Expected (!) exception caught " + e.getMessage());
          }
      
          // Works, as we kill the last datanode, we can now restart 2 datanodes
          // This makes us back with 2 nodes
          dfsCluster.stopDataNode(0);
          dfsCluster
            .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
        }
      

      And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do

      DatanodeInfo[] pipeline = getPipeline(log);
      assertTrue(pipeline.length == fs.getDefaultReplication());
      

      and then kill the datanodes in the pipeline, we will have:

      • most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
      • sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.

      There are two ways of fixing the test:
      1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
      2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.

      Feedback welcome and the choice to make here...

      1. 5163.patch
        7 kB
        Nicolas Liochon
      2. 5163-92.txt
        5 kB
        Ted Yu

        Activity

        Hide
        Ted Yu added a comment -

        Nice analysis.
        I like #2 better. We should always make test scenario deterministic.

        Show
        Ted Yu added a comment - Nice analysis. I like #2 better. We should always make test scenario deterministic.
        Hide
        stack added a comment -

        The lock issue happens in a bunch of tests IIRC. Nice digging N. Can #2 fail ever? If so, I like #1.

        Show
        stack added a comment - The lock issue happens in a bunch of tests IIRC. Nice digging N. Can #2 fail ever? If so, I like #1.
        Hide
        Nicolas Liochon added a comment -

        I've got a fix using a variation of #2. Tested 100 times without any failure.
        The advantage of #1 for me is that it eliminates a quite tricky behavior, but the fix would then be outside HBase...

        Show
        Nicolas Liochon added a comment - I've got a fix using a variation of #2. Tested 100 times without any failure. The advantage of #1 for me is that it eliminates a quite tricky behavior, but the fix would then be outside HBase...
        Hide
        Ted Yu added a comment -

        +1 on patch.
        Looped over the test and didn't encounter test failure.

        Show
        Ted Yu added a comment - +1 on patch. Looped over the test and didn't encounter test failure.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12510114/5163.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 javadoc. The javadoc tool appears to have generated -147 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.replication.TestReplication
        org.apache.hadoop.hbase.mapreduce.TestImportTsv
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/723//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/723//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/723//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510114/5163.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/723//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/723//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/723//console This message is automatically generated.
        Hide
        stack added a comment -

        This is kinda funny:

        -    Thread.sleep(10000);
        +
        

        given your comment above it "+ // This function is synchronous, when it returns the node is killed."

        Good stuff.

        (Woah, you get rid of two ten second pauses...)

        +1 on commit.

        Show
        stack added a comment - This is kinda funny: - Thread .sleep(10000); + given your comment above it "+ // This function is synchronous, when it returns the node is killed." Good stuff. (Woah, you get rid of two ten second pauses...) +1 on commit.
        Hide
        Ted Yu added a comment -

        Integrated to TRUNK.

        Thanks for the patch, N.

        Thanks for the review, Stack.

        Show
        Ted Yu added a comment - Integrated to TRUNK. Thanks for the patch, N. Thanks for the review, Stack.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2618 (See https://builds.apache.org/job/HBase-TRUNK/2618/)
        HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (N Keywal)

        tedyu :
        Files :

        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2618 (See https://builds.apache.org/job/HBase-TRUNK/2618/ ) HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (N Keywal) tedyu : Files : /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-security #73 (See https://builds.apache.org/job/HBase-TRUNK-security/73/)
        HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (N Keywal)

        tedyu :
        Files :

        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-security #73 (See https://builds.apache.org/job/HBase-TRUNK-security/73/ ) HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (N Keywal) tedyu : Files : /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
        Hide
        Ted Yu added a comment -

        Patch I would integrate to 0.92

        Show
        Ted Yu added a comment - Patch I would integrate to 0.92
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12510307/5163-92.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/742//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510307/5163-92.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/742//console This message is automatically generated.
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #241 (See https://builds.apache.org/job/HBase-0.92/241/)
        HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA
        ("The directory is already locked.") (N Keywal)

        tedyu :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #241 (See https://builds.apache.org/job/HBase-0.92/241/ ) HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (N Keywal) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92-security #72 (See https://builds.apache.org/job/HBase-0.92-security/72/)
        HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA
        ("The directory is already locked.") (N Keywal)

        tedyu :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
        Show
        Hudson added a comment - Integrated in HBase-0.92-security #72 (See https://builds.apache.org/job/HBase-0.92-security/72/ ) HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (N Keywal) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java

          People

          • Assignee:
            Nicolas Liochon
            Reporter:
            Nicolas Liochon
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development