The stress test will often print many of the following error:
11:34:00 Process Process-84:
11:34:00 Traceback (most recent call last):
11:34:00 File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
11:34:00 File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
11:34:00 self._target(*self._args, **self._kwargs)
11:34:00 File "tests/stress/concurrent_select.py", line 613, in _start_single_runner
11:34:00 raise Exception("Query failed: %s" % str(report.non_mem_limit_error))
11:34:00 Exception: Query failed:
11:34:00 Couldn't get a client for impala-stress-cdh5-trunk2-5.vpc.cloudera.com:22000 Reason: Couldn't open transport for impala-stress-cdh5-trunk2-5.vpc.cloudera.com:22000 (connect() failed: Connection timed out)
Usually this will fail the job, but occasionally it will recover and keep going (although the error may show up again).
It's hard to catch it exactly when this happens, but I've seen 40+ queries running on the impalads after this occurs.
We need to investigate exactly what is causing this, and then decide what to do about it. This is currently failing a large proportion of stress jobs.