Hive
  1. Hive
  2. HIVE-1829

Fix intermittent failures in TestRemoteMetaStore

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: Metastore
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Notice how Running metastore! appears twice.

      test:
          [junit] Running org.apache.hadoop.hive.metastore.TestEmbeddedHiveMetaStore
          [junit] BR.recoverFromMismatchedToken
          [junit] Tests run: 11, Failures: 0, Errors: 0, Time elapsed: 36.697 sec
          [junit] Running org.apache.hadoop.hive.metastore.TestRemoteHiveMetaStore
          [junit] Running metastore!
          [junit] Running metastore!
          [junit] org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:29083.
          [junit] 	at org.apache.thrift.transport.TServerSocket.<init>(TServerSocket.java:98)
          [junit] 	at org.apache.thrift.transport.TServerSocket.<init>(TServerSocket.java:79)
          [junit] 	at org.apache.hadoop.hive.metastore.TServerSocketKeepAlive.<init>(TServerSocketKeepAlive.java:34)
          [junit] 	at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:2189)
          [junit] 	at org.apache.hadoop.hive.metastore.TestRemoteHiveMetaStore$RunMS.run(TestRemoteHiveMetaStore.java:35)
          [junit] 	at java.lang.Thread.run(Thread.java:619)
          [junit] Running org.apache.hadoop.hive.metastore.TestRemoteHiveMetaStore
          [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
          [junit] Test org.apache.hadoop.hive.metastore.TestRemoteHiveMetaStore FAILED (crashed)
      

        Activity

        Hide
        John Sichi added a comment -

        Committed. Thanks Carl!

        Show
        John Sichi added a comment - Committed. Thanks Carl!
        Hide
        John Sichi added a comment -

        +1. Will commit when tests pass.

        Show
        John Sichi added a comment - +1. Will commit when tests pass.
        Hide
        Carl Steinbach added a comment -
        Show
        Carl Steinbach added a comment - Review request: https://reviews.apache.org/r/260/
        Hide
        Carl Steinbach added a comment -

        This patch attempts to fix the intermittent failures in TestRemoteHiveMetaStore by instituting a 60 second
        wait between consecutive connection attempts in HiveMetaStoreClient. This wait period is configurable
        via the new configuration property hive.metastore.client.connect.retry.delay. The patch also defines the
        new configuration property hive.metastore.client.socket.timeout that is used to set the timeout value on
        Thrift socket wrapped by HiveMetaStoreClient.

        Show
        Carl Steinbach added a comment - This patch attempts to fix the intermittent failures in TestRemoteHiveMetaStore by instituting a 60 second wait between consecutive connection attempts in HiveMetaStoreClient. This wait period is configurable via the new configuration property hive.metastore.client.connect.retry.delay. The patch also defines the new configuration property hive.metastore.client.socket.timeout that is used to set the timeout value on Thrift socket wrapped by HiveMetaStoreClient.
        Hide
        Edward Capriolo added a comment -

        Right I noticed that isServerRunning check is failing. I do not think it is limited to 'slow machines' this fails on server class hardware as well as my laptop with 4GB ram and a quad core.

        Show
        Edward Capriolo added a comment - Right I noticed that isServerRunning check is failing. I do not think it is limited to 'slow machines' this fails on server class hardware as well as my laptop with 4GB ram and a quad core.
        Hide
        Carl Steinbach added a comment -

        Ashutosh's hunch is right – this doesn't have anything to do with multiple IPs.

        The problem is caused by the setUp method in TestRemoteHiveMetaStore:

          protected void setUp() throws Exception {
            super.setUp();
            if(isServerRunning) {
              return;
            }
            Thread t = new Thread(new RunMS());
            t.start();
        
            // Wait a little bit for the metastore to start. Should probably have
            // a better way of detecting if the metastore has started?
            Thread.sleep(5000);
        
            // ...
        
            client = new HiveMetaStoreClient(hiveConf);
            isThriftClient = true;
        
            // Now you have the client - run necessary tests.
            isServerRunning = true;
          }
        

        JUnit calls this method once before running each testcase, and if setUp() throws an
        exception it will fail the current test before continuing onto the next one.

        On slow machines it can take longer than 5 seconds for the MetaStore server process
        to initialize and open a listening socket (RunMS()). When this happens the
        HiveMetaStoreClient constructor fails with an exception that causes setUp to exit before
        setting isServerRunning to true. JUnit then fails the current testcase and immediately begins initializing
        the next testcase, which results in another call to setUp(). Since isServerRunning is still false
        we end up starting another MetaStore server thread which will attempt to open a listening
        socket on the same port as the first thread, which is what causes the TTransportException
        that you see in the log above.

        Show
        Carl Steinbach added a comment - Ashutosh's hunch is right – this doesn't have anything to do with multiple IPs. The problem is caused by the setUp method in TestRemoteHiveMetaStore: protected void setUp() throws Exception { super .setUp(); if (isServerRunning) { return ; } Thread t = new Thread ( new RunMS()); t.start(); // Wait a little bit for the metastore to start. Should probably have // a better way of detecting if the metastore has started? Thread .sleep(5000); // ... client = new HiveMetaStoreClient(hiveConf); isThriftClient = true ; // Now you have the client - run necessary tests. isServerRunning = true ; } JUnit calls this method once before running each testcase, and if setUp() throws an exception it will fail the current test before continuing onto the next one. On slow machines it can take longer than 5 seconds for the MetaStore server process to initialize and open a listening socket (RunMS()). When this happens the HiveMetaStoreClient constructor fails with an exception that causes setUp to exit before setting isServerRunning to true. JUnit then fails the current testcase and immediately begins initializing the next testcase, which results in another call to setUp(). Since isServerRunning is still false we end up starting another MetaStore server thread which will attempt to open a listening socket on the same port as the first thread, which is what causes the TTransportException that you see in the log above.
        Hide
        Edward Capriolo added a comment -

        Thank you for confirming. It is good to know that I am not going crazy. I have two separate machines running into this issue. I noticed this test "disipeared" from trunk so I assume someone else is having problems with it not testing as well.

        Show
        Edward Capriolo added a comment - Thank you for confirming. It is good to know that I am not going crazy. I have two separate machines running into this issue. I noticed this test "disipeared" from trunk so I assume someone else is having problems with it not testing as well.
        Hide
        Ashutosh Chauhan added a comment -

        This issue can now be seen on latest builds on apache machines as well https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/465/console

        Show
        Ashutosh Chauhan added a comment - This issue can now be seen on latest builds on apache machines as well https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/465/console
        Hide
        Ashutosh Chauhan added a comment -

        I see similar errors randomly. But, whenever I see doing following always get me past it:

        TestRemoteHiveMetaStore.java line number 51:
        -Thread.sleep(5000);
        +Thread.sleep(20000);

        So, not sure if it has to do anything with machine having multiple IPs. I also see "Running metastore!" occurring 2-7 times when there is a random failure.

        Show
        Ashutosh Chauhan added a comment - I see similar errors randomly. But, whenever I see doing following always get me past it: TestRemoteHiveMetaStore.java line number 51: -Thread.sleep(5000); +Thread.sleep(20000); So, not sure if it has to do anything with machine having multiple IPs. I also see "Running metastore!" occurring 2-7 times when there is a random failure.

          People

          • Assignee:
            Carl Steinbach
            Reporter:
            Edward Capriolo
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development