1. HBase
  2. HBASE-5163

TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")


    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.92.0, 0.94.0
    • Component/s: test
    • Labels:
    • Environment:


    • Hadoop Flags:


      The stack is typically:

          <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
      	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
      	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
      	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
      	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
      	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
      	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
              // ...

      It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.

      This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes currently in the cluster, and does not take into account previous starts/stops:

         for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
            if (manageDfsDirs) {
              File dir1 = new File(data_dir, "data"+(2*i+1));
              File dir2 = new File(data_dir, "data"+(2*i+2));
            // [...]

      This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:

        public void testMiniDFSCluster_startDataNode() throws Exception {
          assertTrue( dfsCluster.getDataNodes().size() == 2 );
          // Works, as we kill the last datanode, we can now start a datanode
            .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
          // Fails, as it's not the last datanode, the directory will conflict on
          //  creation
          try {
              .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
            fail("There should be an exception because the directory already exists");
          } catch (IOException e) {
            assertTrue( e.getMessage().contains("The directory is already locked."));
            LOG.info("Expected (!) exception caught " + e.getMessage());
          // Works, as we kill the last datanode, we can now restart 2 datanodes
          // This makes us back with 2 nodes
            .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);

      And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do

      DatanodeInfo[] pipeline = getPipeline(log);
      assertTrue(pipeline.length == fs.getDefaultReplication());

      and then kill the datanodes in the pipeline, we will have:

      • most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
      • sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.

      There are two ways of fixing the test:
      1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
      2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.

      Feedback welcome and the choice to make here...

      1. 5163.patch
        7 kB
        Nicolas Liochon
      2. 5163-92.txt
        5 kB
        Ted Yu


        Nicolas Liochon created issue -
        Nicolas Liochon made changes -
        Field Original Value New Value
        Attachment 5163.patch [ 12510114 ]
        Nicolas Liochon made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Ted Yu made changes -
        Summary TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.") TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
        Ted Yu made changes -
        Attachment 5163-92.txt [ 12510307 ]
        Ted Yu made changes -
        Hadoop Flags Reviewed [ 10343 ]
        Fix Version/s 0.92.0 [ 12314223 ]
        Fix Version/s 0.94.0 [ 12316419 ]
        Affects Version/s 0.92.0 [ 12314223 ]
        Affects Version/s 0.94.0 [ 12316419 ]
        Ted Yu made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Lars Hofhansl made changes -
        Status Resolved [ 5 ] Closed [ 6 ]


          • Assignee:
            Nicolas Liochon
            Nicolas Liochon
          • Votes:
            0 Vote for this issue
            0 Start watching this issue


            • Created: