[IMPALA-10811] RPC to submit query getting stuck for AWS NLB forever. - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 4.1.0
Component/s: None
Labels:
None

Target Version:

Impala 4.1.0
Epic Color:
ghx-label-1

Description

Initial RPC to submit a query and fetch the query handle can take quite long time to return as it can do various operations for planning and submission that involve executing Catalog Operations like Rename, Alter Table Recover partition that can take time on tables with many partitions(https://github.com/apache/impala/blob/1231208da7104c832c13f272d1e5b8f554d29337/be/src/exec/catalog-op-executor.cc#L92). Attached is the profile of one such DDL query (with few fields hidden).

These RPCs are:

1. Beeswax:

https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/service/impala-beeswax-server.cc#L57

2. HS2:

https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/service/impala-hs2-server.cc#L462

One of the side effects of such RPC taking long time is that clients such as impala-shell using AWS NLB can get stuck for ever. The reason is NLB tracks and closes connections after 350s and cannot be configured. But after closing the connection it doesn;t send TCP RST to the client. Only when client tries to send data or packets NLB issues back TCP RST to indicate connection is not alive. Documentation is here: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout. Hence the impala-shell waiting for RPC to return gets stuck indefinitely.

Hence, we may need to evaluate techniques for RPCs to return query handle after

and execute later parts of RPC asynchronously in different thread without blocking the RPC. That way clients can get query handle and poll for it for state and results.

Attachments

profile+(13).txt
20/Jul/21 15:00
3 kB
Amogh Margoor

Issue Links

Add Link

causes

IMPALA-10989 TSAN data race during data loading

Resolved

Delete this link

IMPALA-12711 DDL/DML errors are not shown in impalad logs

Resolved

Delete this link

is cloned by

IMPALA-10812 [DOCS] RPC to submit query getting stuck for AWS NLB forever.

Open

Delete this link

relates to

IMPALA-2568 ExecuteStatement RPC (and beeswax query() RPC) should not block

Open

Delete this link

Activity

Ascending order - Click to sort in descending order

Amogh Margoor added a comment - 20/Jul/21 16:18

DOC Jira for the same: IMPALA-10812

Amogh Margoor added a comment - 20/Jul/21 16:18 DOC Jira for the same: IMPALA-10812

Quanlong Huang added a comment - 06/Aug/21 08:06

This seems to duplicates IMPALA-2568.

Quanlong Huang added a comment - 06/Aug/21 08:06 This seems to duplicates IMPALA-2568 .

ASF subversion and git services added a comment - 23/Oct/21 00:09

Commit 975883c47035843398ee99a21fa132f67a0d4954 in impala's branch refs/heads/master from Qifan Chen
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=975883c ]

~~IMPALA-10811~~ RPC to submit query getting stuck for AWS NLB forever

This patch addresses Impala client hang due to AWS network load balancer
timeout which is fixed at 350s. When some long DDL operations are
executing and the timeout happens, AWS silently drops the connection and
the Impala client enters the hang state.

The fix maintains the current TCLIService protocol between the client
and Impala server and is applicable to the following Impala clients
which issue thrift RPC ExecuteStatement() followed by repeated call to
GetOperationStatus() (HS2, Impyla and HUE) or a variant of it (Beeswax)
to Impala backend.

1. HS2
2. Beeswax
3. Impyla
4. HUE

In the fix, the backend method ClientRequestState::ExecDdlRequest()
can start a new thread in 'async_exec_thread_' for ExecDdlRequestImpl()
which executes most of the DDLs asynchronously. This thread is waited
for in the wait thread 'wait_thread_'. Since the wait thread also runs
asynchronously, the execution of the DDLs will not cause a wait on the
Impala client. Thus the Impala client can keep checking its execution
status via GetOperationStatus() without long waiting, say more than
350s.

As an optimization, the above asynchronous mode is not applied to the
execution of certain DDLs that run very low risks of long execution.

1. Operations that do not access catalog service;
2. COMPUTE STATS as the stats computation queries already run
asynchronously.

External behavior change:
1. A new field with name "DDL execution mode:" is added to the
summary section in the runtime profile, next to "DDL Type". This
field takes either 'asynchronous' or 'synchronous' as value.
2. A new query option 'enable_async_ddl_execution', default to true,
is added. It can be set to false to turn off the patch.

Limitations:
This patch does not handle potential AWS NLB-type time out for LOAD
DATA (~~IMPALA-10967~~).

Testing:
1. Added new async. DDL unit tests with HS2, HS2-HTTP, Beeswax and
JDBC clients.
2. Ran core tests successfully.

Change-Id: Ib57e86926a233ef13d27a9ec8d9c36d33a88a44e
Reviewed-on: http://gerrit.cloudera.org:8080/17872
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

ASF subversion and git services added a comment - 23/Oct/21 00:09 Commit 975883c47035843398ee99a21fa132f67a0d4954 in impala's branch refs/heads/master from Qifan Chen [ https://gitbox.apache.org/repos/asf?p=impala.git;h=975883c ] IMPALA-10811 RPC to submit query getting stuck for AWS NLB forever This patch addresses Impala client hang due to AWS network load balancer timeout which is fixed at 350s. When some long DDL operations are executing and the timeout happens, AWS silently drops the connection and the Impala client enters the hang state. The fix maintains the current TCLIService protocol between the client and Impala server and is applicable to the following Impala clients which issue thrift RPC ExecuteStatement() followed by repeated call to GetOperationStatus() (HS2, Impyla and HUE) or a variant of it (Beeswax) to Impala backend. 1. HS2 2. Beeswax 3. Impyla 4. HUE In the fix, the backend method ClientRequestState::ExecDdlRequest() can start a new thread in 'async_exec_thread_' for ExecDdlRequestImpl() which executes most of the DDLs asynchronously. This thread is waited for in the wait thread 'wait_thread_'. Since the wait thread also runs asynchronously, the execution of the DDLs will not cause a wait on the Impala client. Thus the Impala client can keep checking its execution status via GetOperationStatus() without long waiting, say more than 350s. As an optimization, the above asynchronous mode is not applied to the execution of certain DDLs that run very low risks of long execution. 1. Operations that do not access catalog service; 2. COMPUTE STATS as the stats computation queries already run asynchronously. External behavior change: 1. A new field with name "DDL execution mode:" is added to the summary section in the runtime profile, next to "DDL Type". This field takes either 'asynchronous' or 'synchronous' as value. 2. A new query option 'enable_async_ddl_execution', default to true, is added. It can be set to false to turn off the patch. Limitations: This patch does not handle potential AWS NLB-type time out for LOAD DATA ( IMPALA-10967 ). Testing: 1. Added new async. DDL unit tests with HS2, HS2-HTTP, Beeswax and JDBC clients. 2. Ran core tests successfully. Change-Id: Ib57e86926a233ef13d27a9ec8d9c36d33a88a44e Reviewed-on: http://gerrit.cloudera.org:8080/17872 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

Qifan Chen added a comment - 25/Oct/21 13:17

The major work was done in commit 975883c47035843398ee99a21fa132f67a0d4954.

The remaining work on load data is separately tracked in ~~IMPALA-10967~~.

Qifan Chen added a comment - 25/Oct/21 13:17 The major work was done in commit 975883c47035843398ee99a21fa132f67a0d4954. The remaining work on load data is separately tracked in IMPALA-10967 .

ASF subversion and git services added a comment - 09/Feb/22 19:01

Commit 8ddac48f3428c86f2cbd037ced89cfb903298b12 in impala's branch refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8ddac48 ]

~~IMPALA-10989~~: fix race for result set metadata

TSAN tests uncovered a race condition between the
thread reading the result set metadata in
ImpalaServer::GetResultSetMetadata() and the
thread setting the result set metadata in
ClientRequestState::SetResultSet() from
ClientRequestState::ExecDdlRequestImpl().
This is introduced by ~~IMPALA-10811~~, which runs
ExecDdlRequestImpl in an async thread that
can now race with the client thread.

GetResultSetMetadata() holds ClientRequestState's
lock_ while reading the result set metadata, so
the fix is to hold this lock when writing the
result set metadata.

Testing:

Ran TSAN core job

Change-Id: Ic0833ed20d62474c434fa94bbbf8cd8ea99a7cf4
Reviewed-on: http://gerrit.cloudera.org:8080/18212
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>

ASF subversion and git services added a comment - 09/Feb/22 19:01 Commit 8ddac48f3428c86f2cbd037ced89cfb903298b12 in impala's branch refs/heads/master from Joe McDonnell [ https://gitbox.apache.org/repos/asf?p=impala.git;h=8ddac48 ] IMPALA-10989 : fix race for result set metadata TSAN tests uncovered a race condition between the thread reading the result set metadata in ImpalaServer::GetResultSetMetadata() and the thread setting the result set metadata in ClientRequestState::SetResultSet() from ClientRequestState::ExecDdlRequestImpl(). This is introduced by IMPALA-10811 , which runs ExecDdlRequestImpl in an async thread that can now race with the client thread. GetResultSetMetadata() holds ClientRequestState's lock_ while reading the result set metadata, so the fix is to hold this lock when writing the result set metadata. Testing: Ran TSAN core job Change-Id: Ic0833ed20d62474c434fa94bbbf8cd8ea99a7cf4 Reviewed-on: http://gerrit.cloudera.org:8080/18212 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>

Comment

Viewable by All Users

Cancel

IMPALA

RPC to submit query getting stuck for AWS NLB forever.

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment