[HBASE-18078] [C++] Harden RPC by handling various communication abnormalities - ASF JIRA

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: HBASE-14850
Component/s: None
Labels:
None

Description

RPC layer should handle various communication abnormalities (e.g. connection timeout, server aborted connection, and so on). Ideally, the corresponding exceptions should be raised and propagated through handlers of pipeline in client.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-18078.000.patch
26/May/17 05:00
12 kB
Xiaobing Zhou
HBASE-18078.001.patch
26/May/17 19:40
16 kB
Xiaobing Zhou
HBASE-18078.002.patch
30/May/17 05:50
14 kB
Xiaobing Zhou
HBASE-18078.003.patch
06/Jul/17 22:00
32 kB
Xiaobing Zhou
HBASE-18078.004.patch
20/Jul/17 23:52
14 kB
Xiaobing Zhou
HBASE-18078.005.patch
27/Jul/17 00:15
27 kB
Xiaobing Zhou
HBASE-18078.006.patch
02/Aug/17 22:00
10 kB
Xiaobing Zhou
HBASE-18078.007.patch
04/Aug/17 03:50
12 kB
Xiaobing Zhou
HBASE-18078.008.patch
08/Aug/17 00:43
21 kB
Xiaobing Zhou

Issue Links

Add Link

relates to

HBASE-18245 Handle failed server in RpcClient

Closed

Delete this link

Activity

Ascending order - Click to sort in descending order

Enis Soztutar added a comment - 19/May/17 16:24

Great. I think we can detect server going away due to TPC connection closure. However, we should have the RPC timeout as implemented in the Java client. I think wangle might already have something to build the timeouts on.

Enis Soztutar added a comment - 19/May/17 16:24 Great. I think we can detect server going away due to TPC connection closure. However, we should have the RPC timeout as implemented in the Java client. I think wangle might already have something to build the timeouts on.

Xiaobing Zhou added a comment - 26/May/17 05:11

Posted v0 patch to handle connection related issues, such as timeout.

The idea is to catch and propagate ConnectionException up. Therefore, from bottom of stack, whatever issues from establishing connection (i.e. Wangle pipeline) will be folly::AsyncSocketException wrapped within ConnectionException.

Note that there's no way to set socket options in Wangle ClientBootstrap, but fortunately, ClientBootstrap::connect supports setting our own timeout which is enough for now.

See also ConnectionFactory::Connect

try {
	  /* any connection error (e.g. timeout) will be folly::AsyncSocketException */
    auto pipeline = client->connect(
        SocketAddress(hostname, port, true),
        std::chrono::duration_cast<milliseconds>(connect_timeout_)).get();
    auto dispatcher = std::make_shared<ClientDispatcher>();
    dispatcher->setPipeline(pipeline);
    promise.setValue(dispatcher);
	} catch(const folly::AsyncSocketException &e) {
	  promise.setException(
	      folly::make_exception_wrapper<hbase::ConnectionException>(
	          folly::make_exception_wrapper<folly::AsyncSocketException>(e)));
	}

Xiaobing Zhou added a comment - 26/May/17 05:11 Posted v0 patch to handle connection related issues, such as timeout. The idea is to catch and propagate ConnectionException up. Therefore, from bottom of stack, whatever issues from establishing connection (i.e. Wangle pipeline) will be folly::AsyncSocketException wrapped within ConnectionException. Note that there's no way to set socket options in Wangle ClientBootstrap, but fortunately, ClientBootstrap::connect supports setting our own timeout which is enough for now. See also ConnectionFactory::Connect try { /* any connection error (e.g. timeout) will be folly::AsyncSocketException */ auto pipeline = client->connect( SocketAddress(hostname, port, true ), std::chrono::duration_cast<milliseconds>(connect_timeout_)).get(); auto dispatcher = std::make_shared<ClientDispatcher>(); dispatcher->setPipeline(pipeline); promise.setValue(dispatcher); } catch ( const folly::AsyncSocketException &e) { promise.setException( folly::make_exception_wrapper<hbase::ConnectionException>( folly::make_exception_wrapper<folly::AsyncSocketException>(e))); }

Xiaobing Zhou added a comment - 26/May/17 05:20

Coming next are:

apply the APIs (e.g. ConnectionFactory and ConnectionPool) changes to all related places.
In order to simulate various abnormal behaviors, it necessitates building a TestRpcServer, which can be started from samples in BootstrapTest

handle other abnormal cases such as defined in folly::AsyncSocketException

class AsyncSocketException : public std::runtime_error {
 public:
  enum AsyncSocketExceptionType {
    UNKNOWN = 0,
    NOT_OPEN = 1,
    ALREADY_OPEN = 2,
    TIMED_OUT = 3,
    END_OF_FILE = 4,
    INTERRUPTED = 5,
    BAD_ARGS = 6,
    CORRUPTED_DATA = 7,
    INTERNAL_ERROR = 8,
    NOT_SUPPORTED = 9,
    INVALID_STATE = 10,
    SSL_ERROR = 12,
    COULD_NOT_BIND = 13,
    SASL_HANDSHAKE_TIMEOUT = 14,
    NETWORK_ERROR = 15
  };

add unit tests to cover them.

Xiaobing Zhou added a comment - 26/May/17 05:20 Coming next are: apply the APIs (e.g. ConnectionFactory and ConnectionPool) changes to all related places. In order to simulate various abnormal behaviors, it necessitates building a TestRpcServer, which can be started from samples in BootstrapTest handle other abnormal cases such as defined in folly::AsyncSocketException class AsyncSocketException : public std::runtime_error { public : enum AsyncSocketExceptionType { UNKNOWN = 0, NOT_OPEN = 1, ALREADY_OPEN = 2, TIMED_OUT = 3, END_OF_FILE = 4, INTERRUPTED = 5, BAD_ARGS = 6, CORRUPTED_DATA = 7, INTERNAL_ERROR = 8, NOT_SUPPORTED = 9, INVALID_STATE = 10, SSL_ERROR = 12, COULD_NOT_BIND = 13, SASL_HANDSHAKE_TIMEOUT = 14, NETWORK_ERROR = 15 }; add unit tests to cover them.

Xiaobing Zhou added a comment - 26/May/17 19:40

v1 resolves #1.

Xiaobing Zhou added a comment - 26/May/17 19:40 v1 resolves #1.

Xiaobing Zhou added a comment - 30/May/17 05:50

v2 fixed connection pool test failure.

Xiaobing Zhou added a comment - 30/May/17 05:50 v2 fixed connection pool test failure.

Enis Soztutar added a comment - 30/May/17 21:44

You are still calling Future::get() here in the ConnectionFactory::AsyncConnect, no?

+    auto pipeline = client->connect(
+        SocketAddress(hostname, port, true),
+        std::chrono::duration_cast<milliseconds>(connect_timeout_)).get();

If this patch is not changing the blocking nature of our TCP connection establishment, maybe we should not introduce these methods, especially ConnectionPool::AsyncGetNewConnection which seems to be a copy of the other method.

Enis Soztutar added a comment - 30/May/17 21:44 You are still calling Future::get() here in the ConnectionFactory::AsyncConnect , no? + auto pipeline = client->connect( + SocketAddress(hostname, port, true ), + std::chrono::duration_cast<milliseconds>(connect_timeout_)).get(); If this patch is not changing the blocking nature of our TCP connection establishment, maybe we should not introduce these methods, especially ConnectionPool::AsyncGetNewConnection which seems to be a copy of the other method.

Xiaobing Zhou added a comment - 06/Jul/17 22:03 - edited

Posted v3.

fixed rebase conflicts
added RPC test server skeleton

Enis Soztutar you are right, we are not going change behaviors of connection establishment. The initial attempt of AsyncConnect is to satisfy connection pool or pass the existing unit tests. I will get back by removing the async later on. Thanks.

Xiaobing Zhou added a comment - 06/Jul/17 22:03 - edited Posted v3. fixed rebase conflicts added RPC test server skeleton Enis Soztutar you are right, we are not going change behaviors of connection establishment. The initial attempt of AsyncConnect is to satisfy connection pool or pass the existing unit tests. I will get back by removing the async later on. Thanks.

Xiaobing Zhou added a comment - 07/Jul/17 21:57

This ticket will not contain anything related to TestRpcServer, it will be in ~~HBASE-18338~~.

Xiaobing Zhou added a comment - 07/Jul/17 21:57 This ticket will not contain anything related to TestRpcServer, it will be in HBASE-18338 .

Xiaobing Zhou added a comment - 20/Jul/17 23:53

Posted v4:

removed RPC test related pieces.
did some clean work.

Xiaobing Zhou added a comment - 20/Jul/17 23:53 Posted v4: removed RPC test related pieces. did some clean work.

Xiaobing Zhou added a comment - 27/Jul/17 00:18

Posted v5:

added two functions (i.e. ping and addr) implementation in test_rpc_service.proto
rebased ~~HBASE-18338~~
did some refactoring

v6 will come with error implementation that triggers ConnectionException.

Xiaobing Zhou added a comment - 27/Jul/17 00:18 Posted v5: added two functions (i.e. ping and addr) implementation in test_rpc_service.proto rebased HBASE-18338 did some refactoring v6 will come with error implementation that triggers ConnectionException.

Xiaobing Zhou added a comment - 02/Aug/17 22:07

Posted v6:

Catch AsyncSocketException in ConnectionFactory::Connect, and throw it as ConnectionException, which will be propagated up to ConnectionPool::GetConnection and RpcClient::SendRequest, finally caller of RpcClient will get ConnectionException.
While writing data down the pipeline, RpcConnection::SendRequest will also encounter AsyncSocketException, similarly, it's propagated to caller of RpcClient.

The patch considered two cases:

AsyncSocketException as a result of the 1st time connection establishment.
ConnectionException (i.e. AsyncSocketException as a cause) for Request/Response async call after the corresponding connection is established.

Xiaobing Zhou added a comment - 02/Aug/17 22:07 Posted v6: Catch AsyncSocketException in ConnectionFactory::Connect, and throw it as ConnectionException, which will be propagated up to ConnectionPool::GetConnection and RpcClient::SendRequest, finally caller of RpcClient will get ConnectionException. While writing data down the pipeline, RpcConnection::SendRequest will also encounter AsyncSocketException, similarly, it's propagated to caller of RpcClient. The patch considered two cases: AsyncSocketException as a result of the 1st time connection establishment. ConnectionException (i.e. AsyncSocketException as a cause) for Request/Response async call after the corresponding connection is established.

Xiaobing Zhou added a comment - 04/Aug/17 03:53

v7:

added RpcClient::GetFutureWithException
fixe broken tests due to v6

Extra tests come with next patch.

Xiaobing Zhou added a comment - 04/Aug/17 03:53 v7: added RpcClient::GetFutureWithException fixe broken tests due to v6 Extra tests come with next patch.

Hadoop QA added a comment - 04/Aug/17 13:20

-1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	0m 0s	Docker mode activated.
-1	patch	0m 5s	~~HBASE-18078~~ does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.4.0/precommit-patchnames for help.

Subsystem	Report/Notes
JIRA Issue	~~HBASE-18078~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12880339/HBASE-18078.007.patch
Console output	https://builds.apache.org/job/PreCommit-HBASE-Build/7917/console
Powered by	Apache Yetus 0.4.0 http://yetus.apache.org

This message was automatically generated.

Hadoop QA added a comment - 04/Aug/17 13:20 -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 5s HBASE-18078 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.4.0/precommit-patchnames for help. Subsystem Report/Notes JIRA Issue HBASE-18078 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12880339/HBASE-18078.007.patch Console output https://builds.apache.org/job/PreCommit-HBASE-Build/7917/console Powered by Apache Yetus 0.4.0 http://yetus.apache.org This message was automatically generated.

Enis Soztutar added a comment - 04/Aug/17 18:32

Thanks Xiaobing Zhou for the updated patch.
As we were talking offline, we are not gonna do ConnectionRetryPolicy. Upper level retrying from Rpc Retrying Callers will work just fine for re-establishing the connection.

+ * TODO:
+ * This function should plug in ConnectionRetryPolicy to handle:
+

We can do this as VLOG(1):

+          VLOG(3) << folly::sformat("RpcClient Exception: {}", ew.what());

Do we need to catch the AsyncSocketException at this level as well:
```
+    return GetConnection(remote_id)
+        ->SendRequest(std::move(req))
+        .onError([&, this](const folly::exception_wrapper& ew) {
```
or below layers already handle it in every case and rethrow it as ConnectionException.

Other than these, patch looks good. We need the unit tests for committing it.

Enis Soztutar added a comment - 04/Aug/17 18:32 Thanks Xiaobing Zhou for the updated patch. As we were talking offline, we are not gonna do ConnectionRetryPolicy. Upper level retrying from Rpc Retrying Callers will work just fine for re-establishing the connection. + * TODO: + * This function should plug in ConnectionRetryPolicy to handle: + We can do this as VLOG(1): + VLOG(3) << folly::sformat( "RpcClient Exception: {}" , ew.what()); Do we need to catch the AsyncSocketException at this level as well: + return GetConnection(remote_id) + ->SendRequest(std::move(req)) + .onError([&, this ]( const folly::exception_wrapper& ew) { or below layers already handle it in every case and rethrow it as ConnectionException. Other than these, patch looks good. We need the unit tests for committing it.

Xiaobing Zhou added a comment - 04/Aug/17 20:01

Thanks for review. v8 is posted.

removed ConnectionRetryPolicy comment
used VLOG(3) for RpcClient Exception

In addition:

added socketNotOpen function in test_rpc_service.proto
added unit tests for socketNotOpen

In RpcClient::SendRequest, we need to handle two cases for ConnectionException:

The first time connection establishment, i.e. GetConnection(remote_id), AsyncSocketException being a cause.
Writing request down the pipeline, i.e. RpcConnection::SendRequest, AsyncSocketException being a cause as well.

The socketNotOpen covers the case of GetConnection(remote_id). Another test is needed to cover RpcConnection::SendRequest.

Xiaobing Zhou added a comment - 04/Aug/17 20:01 Thanks for review. v8 is posted. removed ConnectionRetryPolicy comment used VLOG(3) for RpcClient Exception In addition: added socketNotOpen function in test_rpc_service.proto added unit tests for socketNotOpen In RpcClient::SendRequest, we need to handle two cases for ConnectionException: The first time connection establishment, i.e. GetConnection(remote_id), AsyncSocketException being a cause. Writing request down the pipeline, i.e. RpcConnection::SendRequest, AsyncSocketException being a cause as well. The socketNotOpen covers the case of GetConnection(remote_id). Another test is needed to cover RpcConnection::SendRequest.

Hadoop QA added a comment - 04/Aug/17 23:08

-1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	0m 0s	Docker mode activated.
-1	patch	1m 41s	~~HBASE-18078~~ does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.4.0/precommit-patchnames for help.

Subsystem	Report/Notes
JIRA Issue	~~HBASE-18078~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12880452/HBASE-18078.008.patch
Console output	https://builds.apache.org/job/PreCommit-HBASE-Build/7933/console
Powered by	Apache Yetus 0.4.0 http://yetus.apache.org

This message was automatically generated.

Hadoop QA added a comment - 04/Aug/17 23:08 -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 1m 41s HBASE-18078 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.4.0/precommit-patchnames for help. Subsystem Report/Notes JIRA Issue HBASE-18078 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12880452/HBASE-18078.008.patch Console output https://builds.apache.org/job/PreCommit-HBASE-Build/7933/console Powered by Apache Yetus 0.4.0 http://yetus.apache.org This message was automatically generated.

Xiaobing Zhou added a comment - 05/Aug/17 00:41

Posted v9:

added fault injection infra, see also RpcFaultInjector and RpcClientFaultInjector
inject fault to close pipeline in ClientDispatcher::operator() to simulate the scenario to get AsyncSocketException after creating connection and before sending request
added unit test

AsyncSocketException is expected, however, Broken promise for type unique_ptr<Response> is returned, will debug into this.

Xiaobing Zhou added a comment - 05/Aug/17 00:41 Posted v9: added fault injection infra, see also RpcFaultInjector and RpcClientFaultInjector inject fault to close pipeline in ClientDispatcher::operator() to simulate the scenario to get AsyncSocketException after creating connection and before sending request added unit test AsyncSocketException is expected, however, Broken promise for type unique_ptr<Response> is returned, will debug into this.

Xiaobing Zhou added a comment - 07/Aug/17 23:50

Since the behavior of closing socket before sending request and after getting connection is beyond expectation (i.e.Broken promise for type unique_ptr<Response>), I've decided to move this case to separate ticket.

v8 is the work to be committed, v9 is dropped.

Xiaobing Zhou added a comment - 07/Aug/17 23:50 Since the behavior of closing socket before sending request and after getting connection is beyond expectation (i.e.Broken promise for type unique_ptr<Response>), I've decided to move this case to separate ticket. v8 is the work to be committed, v9 is dropped.

Enis Soztutar added a comment - 08/Aug/17 02:31

I have committed the v8 version which already addresses the review comments. ~~HBASE-18204~~ builds on top of this. Thanks Xiaobing Zhou for the patch.

Enis Soztutar added a comment - 08/Aug/17 02:31 I have committed the v8 version which already addresses the review comments. HBASE-18204 builds on top of this. Thanks Xiaobing Zhou for the patch.

Enis Soztutar added a comment - 08/Aug/17 06:17

Xiaobing Zhou rpc-test.cc is failing with broken promise after this. Do you mind to take a look. Thanks.

Enis Soztutar added a comment - 08/Aug/17 06:17 Xiaobing Zhou rpc-test.cc is failing with broken promise after this. Do you mind to take a look. Thanks.

Comment

Viewable by All Users

Cancel

People

Assignee:: Xiaobing Zhou

Reporter:: Xiaobing Zhou

Votes:: 0 Vote for this issue

Watchers:: 4
Start watching this issue

Dates

Created:: 19/May/17 05:55

Updated:: 23/Jun/22 20:20

Resolved:: 08/Aug/17 02:31

Agile

View on Board

Slack

In order to see discussions, first confirm access to your Slack account(s) in the following workspace(s): ASF

Issue deployment

HBase

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment