Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4454

test_kudu.TestShowCreateTable fails on S3

    Details

      Description

      I suspect these tests should be disabled on the S3 build - but if so, the 'table already exists' messages are worth understanding.

      =================================== FAILURES ===================================
      ____________ TestShowCreateTable.test_primary_key_and_distribution _____________
      [gw2] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/bin/../infra/python/env/bin/python
      query_test/test_kudu.py:258: in test_primary_key_and_distribution
          db=cursor.conn.db_name, kudu_addr=KUDU_MASTER_HOSTS))
      query_test/test_kudu.py:210: in assert_show_create_equals
          cursor.execute(create_sql.format(**format_args))
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:187: in execute
          configuration=configuration)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:205: in execute_async
          self._execute_async(op)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:220: in _execute_async
          operation_fn()
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:203: in op
          configuration)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:557: in wrapper
          return func(*args, **kwargs)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:642: in execute_statement
          err_if_rpc_not_ok(resp)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:497: in err_if_rpc_not_ok
          raise HiveServer2Error(resp.status.errorMessage)
      E   HiveServer2Error: 
      E   ImpalaRuntimeException: Error creating Kudu table 'impala::d59w6c.mqwvehsdgo'
      E   CAUSED BY: ImpalaRuntimeException: Table 'impala::d59w6c.mqwvehsdgo' already exists in Kudu.
      ---------------------------- Captured stderr setup -----------------------------
      -- connecting to: localhost:21000
      MainThread: Connecting to HiveServer2 localhost:21050 with NOSASL authentication mechanism
      MainThread: Getting a cursor (Impala session)
      MainThread: Executing query CREATE DATABASE d59w6c
      MainThread: Waiting for query to finish
      MainThread: Query finished
      MainThread: Closing HiveServer2Cursor
      MainThread: Closing HS2 connection
      MainThread: Connecting to HiveServer2 localhost:21050 with NOSASL authentication mechanism
      MainThread: Getting a cursor (Impala session)
      MainThread: Using database d59w6c as default
      MainThread: Executing query USE d59w6c
      MainThread: Waiting for query to finish
      MainThread: Query finished
      ----------------------------- Captured stderr call -----------------------------
      MainThread: Executing query 
              CREATE TABLE tkrbgnieqs (c INT PRIMARY KEY)
              DISTRIBUTE BY HASH (c) INTO 3 BUCKETS STORED AS KUDU
      MainThread: Waiting for query to finish
      MainThread: Query finished
      MainThread: Executing query SHOW CREATE TABLE tkrbgnieqs
      MainThread: Waiting for query to finish
      MainThread: Query finished
      MainThread: Fetching all result rows
      MainThread: Executing query 
              CREATE TABLE dynqvtzcte (c INT PRIMARY KEY, d STRING)
              DISTRIBUTE BY HASH (c) INTO 3 BUCKETS, RANGE (c)
              (PARTITION VALUES <= 1, PARTITION 1 < VALUES <= 2,
               PARTITION 2 < VALUES) STORED AS KUDU
      MainThread: Waiting for query to finish
      MainThread: Query finished
      MainThread: Executing query SHOW CREATE TABLE dynqvtzcte
      MainThread: Waiting for query to finish
      MainThread: Query finished
      MainThread: Fetching all result rows
      MainThread: Executing query 
              CREATE TABLE mqwvehsdgo (c INT, PRIMARY KEY (c))
              DISTRIBUTE BY HASH (c) INTO 3 BUCKETS STORED AS KUDU
      MainThread: Failed to open transport (tries_left=3)
      Traceback (most recent call last):
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py", line 557, in wrapper
          return func(*args, **kwargs)
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py", line 640, in execute_statement
          resp = service.ExecuteStatement(req)
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 265, in ExecuteStatement
          return self.recv_ExecuteStatement()
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 276, in recv_ExecuteStatement
          (fname, mtype, rseqid) = self._iprot.readMessageBegin()
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
          sz = self.readI32()
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/protocol/TBinaryProtocol.py", line 206, in readI32
          buff = self.trans.readAll(4)
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/transport/TTransport.py", line 58, in readAll
          chunk = self.read(sz - have)
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/transport/TTransport.py", line 159, in read
          self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/transport/TSocket.py", line 103, in read
          buff = self.handle.recv(sz)
      timeout: timed out
      _____________________ TestShowCreateTable.test_properties ______________________
      [gw2] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/bin/../infra/python/env/bin/python
      query_test/test_kudu.py:311: in test_properties
          db=cursor.conn.db_name, kudu_addr=KUDU_MASTER_HOSTS, props=props))
      query_test/test_kudu.py:210: in assert_show_create_equals
          cursor.execute(create_sql.format(**format_args))
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:187: in execute
          configuration=configuration)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:205: in execute_async
          self._execute_async(op)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:220: in _execute_async
          operation_fn()
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:203: in op
          configuration)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:557: in wrapper
          return func(*args, **kwargs)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:642: in execute_statement
          err_if_rpc_not_ok(resp)
      ../infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py:497: in err_if_rpc_not_ok
          raise HiveServer2Error(resp.status.errorMessage)
      E   HiveServer2Error: 
      E   ImpalaRuntimeException: Error creating Kudu table 'jkhjcyqglv'
      E   CAUSED BY: ImpalaRuntimeException: Table 'jkhjcyqglv' already exists in Kudu.
      ---------------------------- Captured stderr setup -----------------------------
      MainThread: Getting a cursor (Impala session)
      MainThread: Using database d59w6c as default
      MainThread: Executing query USE d59w6c
      MainThread: Waiting for query to finish
      MainThread: Query finished
      ----------------------------- Captured stderr call -----------------------------
      MainThread: Executing query 
              CREATE TABLE ptrixkhewx (c INT PRIMARY KEY)
              DISTRIBUTE BY HASH (c) INTO 3 BUCKETS
              STORED AS KUDU
              TBLPROPERTIES ('kudu.table_name'='jkhjcyqglv')
      MainThread: Failed to open transport (tries_left=3)
      Traceback (most recent call last):
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py", line 557, in wrapper
          return func(*args, **kwargs)
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/impala/hiveserver2.py", line 640, in execute_statement
          resp = service.ExecuteStatement(req)
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 265, in ExecuteStatement
          return self.recv_ExecuteStatement()
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 276, in recv_ExecuteStatement
          (fname, mtype, rseqid) = self._iprot.readMessageBegin()
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
          sz = self.readI32()
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/protocol/TBinaryProtocol.py", line 206, in readI32
          buff = self.trans.readAll(4)
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/transport/TTransport.py", line 58, in readAll
          chunk = self.read(sz - have)
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/transport/TTransport.py", line 159, in read
          self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
        File "/data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/infra/python/env/lib/python2.6/site-packages/thrift/transport/TSocket.py", line 103, in read
          buff = self.handle.recv(sz)
      timeout: timed out
      

        Activity

        Show
        henryr Henry Robinson added a comment - This was the failing build: http://sandbox.jenkins.cloudera.com/job/impala-asf-master-core-s3/126
        Hide
        mjacobs Matthew Jacobs added a comment -

        I think this is what's happening:

        These tests are using hs2 (hiveserver2.py) which has some code that sets a timeout on the socket, and that defaults to 45sec:

        def connect(host, port, timeout=45, use_ssl=False, ca_cert=None,
                    user=None, password=None, kerberos_service_name='impala',
                    auth_mechanism=None):
            log.info('Connecting to HiveServer2 %s:%s with %s authentication '
                     'mechanism', host, port, auth_mechanism)
            sock = get_socket(host, port, use_ssl, ca_cert)
            if six.PY2:
                sock.setTimeout(timeout * 1000.)
            elif six.PY3:
                sock.set_timeout(timeout * 1000.)
        

        Then, ExecuteStatement (and other calls) get wrapped in a @retry decorator which results in retrying the operation for some exceptions. In this case, I believe this operation is taking a really long time (maybe too much concurrency and we're doing something dumb while holding metastoreDdlLock_?), then the socket timeout closes the connection while the Catalog has actually created the kudu table and just not returned yet. Then the calling code retries and finds the table is already there. I do see that the 2nd call occurs 45sec after the first, so that fits.

        I don't know why the operation is taking so long, but I think that's a separate issue since we know that Kudu seems to slow down significantly under load (even metadata operations).

        I think this test code shouldn't be timing out after 45sec (we give Kudu several minutes for DDL ops now, we should probably do the same), and when it does time out DDL statements it shouldn't be retrying.

        Separately we need to get more information about why this is taking so long, but that can be done after fixing the test code to unblock builds.

        Show
        mjacobs Matthew Jacobs added a comment - I think this is what's happening: These tests are using hs2 (hiveserver2.py) which has some code that sets a timeout on the socket, and that defaults to 45sec: def connect(host, port, timeout=45, use_ssl=False, ca_cert=None, user=None, password=None, kerberos_service_name='impala', auth_mechanism=None): log.info('Connecting to HiveServer2 %s:%s with %s authentication ' 'mechanism', host, port, auth_mechanism) sock = get_socket(host, port, use_ssl, ca_cert) if six.PY2: sock.setTimeout(timeout * 1000.) elif six.PY3: sock.set_timeout(timeout * 1000.) Then, ExecuteStatement (and other calls) get wrapped in a @retry decorator which results in retrying the operation for some exceptions. In this case, I believe this operation is taking a really long time (maybe too much concurrency and we're doing something dumb while holding metastoreDdlLock_?), then the socket timeout closes the connection while the Catalog has actually created the kudu table and just not returned yet. Then the calling code retries and finds the table is already there. I do see that the 2nd call occurs 45sec after the first, so that fits. I don't know why the operation is taking so long, but I think that's a separate issue since we know that Kudu seems to slow down significantly under load (even metadata operations). I think this test code shouldn't be timing out after 45sec (we give Kudu several minutes for DDL ops now, we should probably do the same), and when it does time out DDL statements it shouldn't be retrying. Separately we need to get more information about why this is taking so long, but that can be done after fixing the test code to unblock builds.
        Hide
        twmarshall Thomas Tauber-Marshall added a comment -

        commit d15f86cb6fa5eb482b039f036dab3d979c7513c8
        Author: Thomas Tauber-Marshall <tmarshall@cloudera.com>
        Date: Thu Nov 10 15:13:56 2016 -0800

        IMPALA-4454: test_kudu.TestShowCreateTable flaky

        The cause of the flakiness is Kudu CREATE TABLE operations
        that are sometimes taking a long time, leading to timeouts
        in the hiveserver2 connection. This patch adds the ability
        for tests using the 'conn' pytest fixture to specify a
        timeout to connect(), and sets a timeout of 5 minutes for
        this test.

        Change-Id: I2727c27ff66140ac4043bcad332cd4e1d72b255f
        Reviewed-on: http://gerrit.cloudera.org:8080/5040
        Reviewed-by: Michael Brown <mikeb@cloudera.com>
        Reviewed-by: Alex Behm <alex.behm@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        twmarshall Thomas Tauber-Marshall added a comment - commit d15f86cb6fa5eb482b039f036dab3d979c7513c8 Author: Thomas Tauber-Marshall <tmarshall@cloudera.com> Date: Thu Nov 10 15:13:56 2016 -0800 IMPALA-4454 : test_kudu.TestShowCreateTable flaky The cause of the flakiness is Kudu CREATE TABLE operations that are sometimes taking a long time, leading to timeouts in the hiveserver2 connection. This patch adds the ability for tests using the 'conn' pytest fixture to specify a timeout to connect(), and sets a timeout of 5 minutes for this test. Change-Id: I2727c27ff66140ac4043bcad332cd4e1d72b255f Reviewed-on: http://gerrit.cloudera.org:8080/5040 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            twmarshall Thomas Tauber-Marshall
            Reporter:
            henryr Henry Robinson
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development