Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-5632

Intermittent failure in TestOzoneManagerBootstrap#testBootstrapTwoNewOMs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      Stacktrace as follows:

      // code placeholder
      Error:  testBootstrapTwoNewOMs  Time elapsed: 66.255 s  <<< ERROR!
      java.io.IOException: Failed init RocksDB, db path : /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-585ca9ba-85a5-4b41-a979-d85c644b1560/omNode-bootstrap-1/om.db, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1629174403 acquiring thread 140196605994752: /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-585ca9ba-85a5-4b41-a979-d85c644b1560/omNode-bootstrap-1/om.db/LOCK: No locks available; status : IOError; message : lock hold by current process, acquire time 1629174403 acquiring thread 140196605994752: /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-585ca9ba-85a5-4b41-a979-d85c644b1560/omNode-bootstrap-1/om.db/LOCK: No locks available 
          at org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:564) 
          at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:164) 
          at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:191) 
          at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:397) 
          at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:387) 
          at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.start(OmMetadataManagerImpl.java:379) 
          at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:246) 
          at org.apache.hadoop.ozone.om.OzoneManager.instantiateServices(OzoneManager.java:581) 
          at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:505) 
          at org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:552) 
          at org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.bootstrapNewOM(MiniOzoneHAClusterImpl.java:791) 
          at org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.bootstrapOzoneManager(MiniOzoneHAClusterImpl.java:706) 
          at org.apache.hadoop.ozone.om.TestOzoneManagerBootstrap.testBootstrapOMs(TestOzoneManagerBootstrap.java:156) 
          at org.apache.hadoop.ozone.om.TestOzoneManagerBootstrap.testBootstrapTwoNewOMs(TestOzoneManagerBootstrap.java:180) 
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
          at java.lang.reflect.Method.invoke(Method.java:498) 
          at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) 
          at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 
          at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) 
          at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 
          at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
          at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:258) 
          at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288) 
          at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282) 
          at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
          at java.lang.Thread.run(Thread.java:748)
      Caused by: org.rocksdb.RocksDBException: lock hold by current process, acquire time 1629174403 acquiring thread 140196605994752: /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-585ca9ba-85a5-4b41-a979-d85c644b1560/omNode-bootstrap-1/om.db/LOCK: No locks available 
          at org.rocksdb.RocksDB.open(Native Method) at org.rocksdb.RocksDB.open(RocksDB.java:306)
          at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:119) ... 26 more

      Root cause is when MiniOzoneHAClusterImpl#bootstrapOzoneManager is creating a new OM, it may encounter a port conflict, this function will retry with a new port, but before that, the metadataManager of the first OM didn't close the lock on the rocksdb, which causes the test to fail for the retry.

      Options to solve:

      1. I tried to add a "metadataManager.stop()" in the constructor of OM when it fails to start RPC server, but it will prompt another error about the lock on ratis directory.
      2. I tried to stop the ratisServer too, but in https://github.com/apache/ratis/blob/dc0b68b4c0b8c187a08f669422a2cd099d7be0b7/ratis-common/src/main/java/org/apache/ratis/util/LifeCycle.java#L308, the close function will not be called, so the lock won't be released. Tried to call the closeMethod for State.NEW, but something wrong else happened.
      3. So I think it's much easier to just check if the port is available in MiniOzoneHAClusterImpl. 

      Steps to reproduce:

      Change the generation of basePort to the following code, then the error would happen for omNode-bootstrap-2 in testBootstrapTwoNewOMs.

      @@ -697,9 +698,11 @@ public void bootstrapOzoneManager(String omNodeId) throws Exception {
       
           long leaderSnapshotIndex = getOMLeader().getRatisSnapshotIndex();
       
      +    int start = 0;
           while (true) {
             try {
      -        basePort = 10000 + RANDOM.nextInt(1000) * 4;
      +//        basePort = 10000 + RANDOM.nextInt(1000) * 4;
      +        basePort = 10000 + start * 4;
               OzoneConfiguration newConf = addNewOMToConfig(getOMServiceId(),
                   omNodeId, basePort);
       
      @@ -721,6 +724,7 @@ public void bootstrapOzoneManager(String omNodeId) throws Exception {
               if (e instanceof BindException ||
                   e.getCause() instanceof BindException) {
                 ++retryCount;
      +          start++;
                 LOG.info("MiniOzoneHACluster port conflicts, retried {} times",
                     retryCount);
               } else {
      

      Attachments

        Issue Links

          Activity

            People

              Symious Janus Chow
              Symious Janus Chow
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: