Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2531

TestDFSClientExcludedNodes&TestBlocksScheduledCounter can cause for random failures iin Eclipse.

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.24.0
    • Fix Version/s: None
    • Component/s: test
    • Labels:
      None

      Description

      FAILED: org.apache.hadoop.hdfs.TestDFSClientExcludedNodes.testExcludedNodes

      Error Message:
      Cannot lock storage /home/jenkins/jenkins-slave/workspace/Hadoop-Hdfs-trunk/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1. The directory is already locked.

      Stack Trace:
      java.io.IOException: Cannot lock storage /home/jenkins/jenkins-slave/workspace/Hadoop-Hdfs-trunk/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1. The directory is already locked.
      at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:586)
      at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:435)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:253)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:169)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:371)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:314)
      at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:298)
      at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:332)

      1. HDFS-2531.patch
        3 kB
        Uma Maheswara Rao G

        Activity

        Hide
        Uma Maheswara Rao G added a comment -

        TestFileCreationNamenodeRestart also failing with the same error in trunk.

        java.io.IOException: Cannot lock storage /home/jenkins/jenkins-slave/workspace/Hadoop-Hdfs-trunk/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2. The directory is already locked.
        at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:586)
        at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:435)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:253)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:169)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:371)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:314)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:298)

        Show
        Uma Maheswara Rao G added a comment - TestFileCreationNamenodeRestart also failing with the same error in trunk. java.io.IOException: Cannot lock storage /home/jenkins/jenkins-slave/workspace/Hadoop-Hdfs-trunk/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:586) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:435) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:253) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:169) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:371) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:314) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:298)
        Hide
        Uma Maheswara Rao G added a comment -

        This problem will come when some tests are still having the locks on storage directories.
        Looks the failures are random.
        TestDFSClientExcludedNodes is not shotting down the cluster, so the next test will fail diffinitely.

        To find the cause for TestDFSClientExcludedNodes failure, need to check other test where they are not shutting down the cluster.

        Show
        Uma Maheswara Rao G added a comment - This problem will come when some tests are still having the locks on storage directories. Looks the failures are random. TestDFSClientExcludedNodes is not shotting down the cluster, so the next test will fail diffinitely. To find the cause for TestDFSClientExcludedNodes failure, need to check other test where they are not shutting down the cluster.
        Hide
        Uma Maheswara Rao G added a comment -

        After analysing other tests, found that TestBlocksScheduledCounter is not shutting down the cluster. So, TestDFSClientExcludedNodes might be the next test while runing in Hudson.

        Show
        Uma Maheswara Rao G added a comment - After analysing other tests, found that TestBlocksScheduledCounter is not shutting down the cluster. So, TestDFSClientExcludedNodes might be the next test while runing in Hudson.
        Hide
        Todd Lipcon added a comment -

        I think this is failing because of TestDfsOverAvroRpc timing out, at least in recent builds I've seen.

        Show
        Todd Lipcon added a comment - I think this is failing because of TestDfsOverAvroRpc timing out, at least in recent builds I've seen.
        Hide
        Uma Maheswara Rao G added a comment -

        Hi Todd, Here is the latest report https://builds.apache.org/job/Hadoop-Hdfs-trunk/lastCompletedBuild/testReport/

        It shows only two failures (TestDFSClientExcludedNodes and TestFileCreationNamenodeRestart ).
        If TestDfsOverAvroRpc timing out, it should be listed in failures list right?
        surprisingly i couldn't see TestDfsOverAvroRpc in passed test cases also.

        Show
        Uma Maheswara Rao G added a comment - Hi Todd, Here is the latest report https://builds.apache.org/job/Hadoop-Hdfs-trunk/lastCompletedBuild/testReport/ It shows only two failures (TestDFSClientExcludedNodes and TestFileCreationNamenodeRestart ). If TestDfsOverAvroRpc timing out, it should be listed in failures list right? surprisingly i couldn't see TestDfsOverAvroRpc in passed test cases also.
        Hide
        Uma Maheswara Rao G added a comment -

        Updated the patch for trunk!.
        This tests can create random failures .Patch fixes the problem.

        Show
        Uma Maheswara Rao G added a comment - Updated the patch for trunk!. This tests can create random failures .Patch fixes the problem.
        Hide
        Uma Maheswara Rao G added a comment -

        As Todd pointed out, problem could be because of TestDfsOverAvroRpc as well. Fixing this current problems also can avoid some random failures.

        Show
        Uma Maheswara Rao G added a comment - As Todd pointed out, problem could be because of TestDfsOverAvroRpc as well. Fixing this current problems also can avoid some random failures.
        Hide
        Aaron T. Myers added a comment -

        Hi Uma, the patch looks fine, and is a good change to make, but I don't see how this addresses the cause of these failures. I believe the test cases are run in separate JVMs, which should result in all file locks being released before the next test is run.

        Were you able to reproduce these failures in your own environment? If so, how? And can you verify that this patch fixes the issue?

        Show
        Aaron T. Myers added a comment - Hi Uma, the patch looks fine, and is a good change to make, but I don't see how this addresses the cause of these failures. I believe the test cases are run in separate JVMs, which should result in all file locks being released before the next test is run. Were you able to reproduce these failures in your own environment? If so, how? And can you verify that this patch fixes the issue?
        Hide
        Uma Maheswara Rao G added a comment -

        Yes Aaron, you are right. If we specify fork option, it will spawn separate JVM.
        As Todd pointed, TestDfsOverAvroRpc would be the real cause for test failures. This can be consider to ensure correct pattern in tests. When we add any new tests need not change others.

        Show
        Uma Maheswara Rao G added a comment - Yes Aaron, you are right. If we specify fork option, it will spawn separate JVM. As Todd pointed, TestDfsOverAvroRpc would be the real cause for test failures. This can be consider to ensure correct pattern in tests. When we add any new tests need not change others.
        Hide
        Uma Maheswara Rao G added a comment -

        Why i recommend this change is, when debugging failures, i gave Junit test run in Eclipse on hdfs package (failures are from). There i noticed these 2 tests are creating random failures. But in Hudson, failures cause would be Avro test case.
        I did not notice TestDfsOverAvroRpc itself in report, beacuse there is not timeout for this test. Todd, gave the patch for it HDFS-2532.

        Show
        Uma Maheswara Rao G added a comment - Why i recommend this change is, when debugging failures, i gave Junit test run in Eclipse on hdfs package (failures are from). There i noticed these 2 tests are creating random failures. But in Hudson, failures cause would be Avro test case. I did not notice TestDfsOverAvroRpc itself in report, beacuse there is not timeout for this test. Todd, gave the patch for it HDFS-2532 .
        Hide
        Aaron T. Myers added a comment -

        Could you please file a separate JIRA to fix these two tests, then?

        Show
        Aaron T. Myers added a comment - Could you please file a separate JIRA to fix these two tests, then?
        Hide
        Uma Maheswara Rao G added a comment -

        Thanks a lot for taking a look!
        This is the Jira HDFS-2532, actually to fix the random failures in trunk.

        Show
        Uma Maheswara Rao G added a comment - Thanks a lot for taking a look! This is the Jira HDFS-2532 , actually to fix the random failures in trunk.

          People

          • Assignee:
            Uma Maheswara Rao G
            Reporter:
            Uma Maheswara Rao G
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development