[IMPALA-8322] Cancelling an in-progress IO can cause delays and/or RPC timeouts due to tangle of locks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 3.3.0
Fix Version/s: Impala 3.3.0
Component/s: Backend
Labels:
- broken-build

Epic Color:
ghx-label-2

Description

This has been seen multiple times when running s3 tests:

query_test/test_join_queries.py:57: in test_basic_joins
    self.run_test_case('QueryTest/joins', new_vector)
common/impala_test_suite.py:472: in run_test_case
    result = self.__execute_query(target_impalad_client, query, user=user)
common/impala_test_suite.py:699: in __execute_query
    return impalad_client.execute(query, user=user)
common/impala_connection.py:174: in execute
    return self.__beeswax_client.execute(sql_stmt, user=user)
beeswax/impala_beeswax.py:183: in execute
    handle = self.__execute_query(query_string.strip(), user=user)
beeswax/impala_beeswax.py:360: in __execute_query
    self.wait_for_finished(handle)
beeswax/impala_beeswax.py:381: in wait_for_finished
    raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
E   ImpalaBeeswaxException: ImpalaBeeswaxException:
E    Query aborted:Sender 127.0.0.1 timed out waiting for receiver fragment instance: 6c40d992bb87af2f:0ce96e5d00000007, dest node: 4

This is related to IMPALA-6818. On a bad run, there are various time outs in the impalad logs:

I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 timed out waiting for receiver fragment instance: ef4a5dc32a6565bd:a8720b8500000007, dest node: 5
I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 14881) took 120182ms. Request Metrics: {}
I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 timed out waiting for receiver fragment instance: d148d83e11a4603d:54dc35f700000004, dest node: 3
I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 14880) took 123097ms. Request Metrics: {}
... various messages ...
I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
... various messages ...
W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request call id 202) took 48695ms (client timeout 10000).
W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace:
0316 10:47:26.361265 (+ 0us) impala-service-pool.cc:165] Inserting onto call queue
0316 10:47:26.361285 (+ 20us) impala-service-pool.cc:245] Handling call
0316 10:48:15.056398 (+48695113us) inbound_call.cc:162] Queueing success response
Metrics: {}
I0316 10:48:15.057087 20139 connection.cc:584] Got response to call id 202 after client already timed out or cancelled

So far, this has only happened on s3. The system load at the time is not higher than normal. If anything it is lower than normal.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

run_tests_swimlane.json.gz
20/Mar/19 20:10
84 kB
Joe McDonnell
fb5b9729-2d7a-4590-ea365b87-d2ead75e.dmp_dumped
03/Apr/19 01:00
2.32 MB
Joe McDonnell

Activity

People

Assignee:: Joe McDonnell

Reporter:: Joe McDonnell

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/Mar/19 18:35

Updated:: 12/Mar/20 16:20

Resolved:: 15/Apr/19 16:43