Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3299

Stress test failures: Couldn't open transport for impala-stress-cdh5-trunk2-5.vpc.cloudera.com:22000

    Details

      Description

      The stress test will often print many of the following error:

      11:34:00 Process Process-84:
      11:34:00 Traceback (most recent call last):
      11:34:00   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
      11:34:00     self.run()
      11:34:00   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
      11:34:00     self._target(*self._args, **self._kwargs)
      11:34:00   File "tests/stress/concurrent_select.py", line 613, in _start_single_runner
      11:34:00     raise Exception("Query failed: %s" % str(report.non_mem_limit_error))
      11:34:00 Exception: Query failed: 
      11:34:00 Couldn't get a client for impala-stress-cdh5-trunk2-5.vpc.cloudera.com:22000	Reason: Couldn't open transport for impala-stress-cdh5-trunk2-5.vpc.cloudera.com:22000 (connect() failed: Connection timed out)
      

      e.g. http://sandbox.jenkins.cloudera.com/job/Impala-Stress-Test-EC2-CDH5-trunk/621/console

      Usually this will fail the job, but occasionally it will recover and keep going (although the error may show up again).

      It's hard to catch it exactly when this happens, but I've seen 40+ queries running on the impalads after this occurs.

      We need to investigate exactly what is causing this, and then decide what to do about it. This is currently failing a large proportion of stress jobs.

        Issue Links

          Activity

          Hide
          skye Skye Wanderman-Milne added a comment -

          We started seeing this because of the following change: https://github.com/cloudera/Impala/commit/13e3818a4577437a3e197f4a5f482dac350bf120#diff-517419b415e2dece48019013243d50faR605. Before we were ignoring queries that failed with a "Connection timed out" error, but now we don't if the message also contains "connect()", which these ones do.

          The "Couldn't get a client for ..." message comes from FragmentExecState::ReportStatusCb(). The query fails with this message when a backend can't connect to the coordinator to update it's status. It's unclear why we're seeing so many of these, especially why we're seeing them on the EC2 stress cluster and not the physical cluster. It could be because the EC2 machines are smaller and getting too hosed to make the connection before the timeout expires. Unfortunately there are so many other problems starting an EC2 stress cluster that I'm having a hard time reproing this.

          Show
          skye Skye Wanderman-Milne added a comment - We started seeing this because of the following change: https://github.com/cloudera/Impala/commit/13e3818a4577437a3e197f4a5f482dac350bf120#diff-517419b415e2dece48019013243d50faR605 . Before we were ignoring queries that failed with a "Connection timed out" error, but now we don't if the message also contains "connect()", which these ones do. The "Couldn't get a client for ..." message comes from FragmentExecState::ReportStatusCb(). The query fails with this message when a backend can't connect to the coordinator to update it's status. It's unclear why we're seeing so many of these, especially why we're seeing them on the EC2 stress cluster and not the physical cluster. It could be because the EC2 machines are smaller and getting too hosed to make the connection before the timeout expires. Unfortunately there are so many other problems starting an EC2 stress cluster that I'm having a hard time reproing this.
          Hide
          dhecht Dan Hecht added a comment -

          Note also that before this change:

          commit 3bb07af695cdf64cfeb9e82b23f1efb52949af49
          Author: Sailesh Mukil <sailesh@cloudera.com>
          Date:   Thu Jan 28 23:39:57 2016 -0800
          

          we didn't report the reason, so that will also be a difference comparing old to new errors.

          Show
          dhecht Dan Hecht added a comment - Note also that before this change: commit 3bb07af695cdf64cfeb9e82b23f1efb52949af49 Author: Sailesh Mukil <sailesh@cloudera.com> Date: Thu Jan 28 23:39:57 2016 -0800 we didn't report the reason, so that will also be a difference comparing old to new errors.
          Hide
          henryr Henry Robinson added a comment -

          Tim Armstrong - have you seen any errors like this in recent stress-test runs?

          Show
          henryr Henry Robinson added a comment - Tim Armstrong - have you seen any errors like this in recent stress-test runs?
          Hide
          tarmstrong Tim Armstrong added a comment -

          I found a couple of instances of something seemingly related.

          http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/Impala-Stress-Test-Physical/657/console

          17:20:53 17:20:53 12693 140458147690240 ERROR:hiveserver2[560]:Failed to open transport (tries_left=3)
          17:20:53 Traceback (most recent call last):
          17:20:53   File "/var/lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/hiveserver2.py", line 557, in wrapper
          17:20:53     return func(*args, **kwargs)
          17:20:53   File "/var/lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/hiveserver2.py", line 695, in fetch_results
          17:20:53     resp = service.FetchResults(req)
          17:20:53   File "/var/lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 625, in FetchResults
          17:20:53     return self.recv_FetchResults()
          17:20:53   File "/var/lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 636, in recv_FetchResults
          17:20:53     (fname, mtype, rseqid) = self._iprot.readMessageBegin()
          17:20:53   File "/var/lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
          17:20:53     sz = self.readI32()
          17:20:53   File "/var/lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/protocol/TBinaryProtocol.py", line 203, in readI32
          17:20:53     buff = self.trans.readAll(4)
          17:20:53   File "/var/lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/transport/TTransport.py", line 58, in readAll
          17:20:53     chunk = self.read(sz-have)
          17:20:53   File "/var/lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/transport/TTransport.py", line 155, in read
          17:20:53     self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER)))
          17:20:53   File "/var/lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/transport/TSocket.py", line 92, in read
          17:20:53     buff = self.handle.recv(sz)
          17:20:53 error: [Errno 11] Resource temporarily unavailable
          
          Show
          tarmstrong Tim Armstrong added a comment - I found a couple of instances of something seemingly related. http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/Impala-Stress-Test-Physical/657/console 17:20:53 17:20:53 12693 140458147690240 ERROR:hiveserver2[560]:Failed to open transport (tries_left=3) 17:20:53 Traceback (most recent call last): 17:20:53 File "/ var /lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/hiveserver2.py" , line 557, in wrapper 17:20:53 return func(*args, **kwargs) 17:20:53 File "/ var /lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/hiveserver2.py" , line 695, in fetch_results 17:20:53 resp = service.FetchResults(req) 17:20:53 File "/ var /lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py" , line 625, in FetchResults 17:20:53 return self.recv_FetchResults() 17:20:53 File "/ var /lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py" , line 636, in recv_FetchResults 17:20:53 (fname, mtype, rseqid) = self._iprot.readMessageBegin() 17:20:53 File "/ var /lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/protocol/TBinaryProtocol.py" , line 126, in readMessageBegin 17:20:53 sz = self.readI32() 17:20:53 File "/ var /lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/protocol/TBinaryProtocol.py" , line 203, in readI32 17:20:53 buff = self.trans.readAll(4) 17:20:53 File "/ var /lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/transport/TTransport.py" , line 58, in readAll 17:20:53 chunk = self.read(sz-have) 17:20:53 File "/ var /lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/transport/TTransport.py" , line 155, in read 17:20:53 self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER))) 17:20:53 File "/ var /lib/jenkins/workspace/Impala-Stress-Test-Physical/Impala/thirdparty/hive-1.1.0-cdh5.9.0-SNAPSHOT/lib/py/thrift/transport/TSocket.py" , line 92, in read 17:20:53 buff = self.handle.recv(sz) 17:20:53 error: [Errno 11] Resource temporarily unavailable
          Hide
          henryr Henry Robinson added a comment -

          Guessing this is due to IMPALA-4135.

          Show
          henryr Henry Robinson added a comment - Guessing this is due to IMPALA-4135 .
          Hide
          henryr Henry Robinson added a comment -

          I think this is IMPALA-4135. Checked the stress tests and I don't think we've seen it crop up since IMPALA-4135 was committed.

          Show
          henryr Henry Robinson added a comment - I think this is IMPALA-4135 . Checked the stress tests and I don't think we've seen it crop up since IMPALA-4135 was committed.

            People

            • Assignee:
              henryr Henry Robinson
              Reporter:
              skye Skye Wanderman-Milne
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development