[DRILL-5599] Notify StatusHandlerListener that batch sending has failed even if channel is still open - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.11.0
Fix Version/s: 1.11.0
Component/s: None
Labels:
- ready-to-commit

Description

Issue
Queries stay in CANCELLATION_REQUESTED state after connection with client was interrupted. Jstack shows that threads for such queries are blocked and waiting to semaphore to be released.

"26b70318-ddde-9ead-eee2-0828da97b59f:frag:0:0" daemon prio=10 tid=0x00007f56dc3c9000 nid=0x25fd waiting on condition [0x00007f56b31dc000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006f4688ab0> (a java.util.concurrent.Semaphore$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
	at java.util.concurrent.Semaphore.acquire(Semaphore.java:472)
	at org.apache.drill.exec.ops.SendingAccountor.waitForSendComplete(SendingAccountor.java:48)
	- locked <0x00000006f4688a78> (a org.apache.drill.exec.ops.SendingAccountor)
	at org.apache.drill.exec.ops.FragmentContext.waitForSendComplete(FragmentContext.java:486)
	at org.apache.drill.exec.physical.impl.BaseRootExec.close(BaseRootExec.java:134)
	at org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.close(ScreenCreator.java:141)
	at org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:313)
	at org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:155)
	at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:264)
	at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
   Locked ownable synchronizers:	- <0x000000073f800b68> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Reproduce
Ran modified ConcurrencyTest.java referenced in ~~DRILL-4338~~ and cancel after 2-3 seconds. ConcurrencyTest.java should be modified as follows:
ExecutorService executor = Executors.newFixedThreadPool(10); and execute 200 queries for (int i = 1; i <= 200; i++).
Query: select * from dfs.`sample.json`, data set is attached.

Problem description
Looks like the problem occurs when the server has sent data to the client and waiting from the client confirmation that data was received. In this case ChannelListenerWithCoordinationId is used for tracking. ChannelListenerWithCoordinationId contains StatusHandler which keeps track of sent batches. It updates SendingAccountor with information about how many batches were sent and how many batches have reached the client (successfully or not).

When sent operation is complete (successfully or not) operationComplete(ChannelFuture future) is called. Given future contains information if sent operation was successful or not, failure cause, channel status etc. If sent operation was successful we do nothing since in this case client sent us acknowledgment and when we received it, we notified StatusHandlerListener that batch was received. But if sent operation has failed, we need to notify StatusHandler that sent was unsuccessful.

operationComplete(ChannelFuture future) code:

      if (!future.isSuccess()) {
        removeFromMap(coordinationId);
        if (future.channel().isActive()) {
          throw new RpcException("Future failed");
        } else {
          setException(new ChannelClosedException());
        }
      }
    }

Method setException notifies StatusHandler that batch sent has failed but it's only called when channel is closed. When channel is still open we just throw RpcException. This is where the problem occurs. operationComplete(ChannelFuture future) is called via Netty DefaultPromise.notifyListener0 method which catches Throwable and just logs it. So even of we throw exception nobody is notified about it, especially StatusHandler.

Fix
Use setException even if channel is still open instead of throwing exception.

This problem was also raised in PR-463 but was decided to be fixed in the scope of new Jira.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sample.json
20/Jun/17 11:36
19 kB
Arina Ielchiieva

Issue Links

is duplicated by

DRILL-3665 Deadlock while executing CTAS that runs out of memory

Resolved

relates to

DRILL-4296 Query hangs in CANCELLATION_REQUESTED when cancelled after it starts returning results

Resolved

DRILL-3241 Query with window function runs out of direct memory and does not report back to client that it did

Closed

DRILL-3119 Query stays in "CANCELLATION_REQUESTED" status in UI after OOM of Direct buffer memory

Closed

DRILL-4338 Concurrent query remains in CANCELLATION_REQUESTED state

Closed

DRILL-4595 FragmentExecutor.fail() should interrupt the fragment thread to avoid possible query hangs

Closed

links to

GitHub Pull Request #857

(1 relates to, 1 links to)

Activity

Ascending order - Click to sort in descending order

ASF GitHub Bot added a comment - 20/Jun/17 11:42

GitHub user arina-ielchiieva opened a pull request:

https://github.com/apache/drill/pull/857

~~DRILL-5599~~: Notify StatusHandlerListener that batch sending has failed even if channel is still open

Details in ~~DRILL-5599~~(https://issues.apache.org/jira/browse/DRILL-5599)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/arina-ielchiieva/drill ~~DRILL-5599~~

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/857.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #857

commit 4cf04b6476de899bea6e6cfab98f3e8e2aaaa7c6
Author: Arina Ielchiieva <arina.yelchiyeva@gmail.com>
Date: 2017-06-20T09:18:27Z

~~DRILL-5599~~: Notify StatusHandlerListener that batch sending has failed even if channel is still open

ASF GitHub Bot added a comment - 20/Jun/17 11:42 GitHub user arina-ielchiieva opened a pull request: https://github.com/apache/drill/pull/857 DRILL-5599 : Notify StatusHandlerListener that batch sending has failed even if channel is still open Details in DRILL-5599 ( https://issues.apache.org/jira/browse/DRILL-5599 ) You can merge this pull request into a Git repository by running: $ git pull https://github.com/arina-ielchiieva/drill DRILL-5599 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/857.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #857 commit 4cf04b6476de899bea6e6cfab98f3e8e2aaaa7c6 Author: Arina Ielchiieva <arina.yelchiyeva@gmail.com> Date: 2017-06-20T09:18:27Z DRILL-5599 : Notify StatusHandlerListener that batch sending has failed even if channel is still open

ASF GitHub Bot added a comment - 20/Jun/17 23:07

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/857#discussion_r123118764

— Diff: exec/rpc/src/main/java/org/apache/drill/exec/rpc/RequestIdMap.java —
@@ -111,13 +111,16 @@ public RpcListener(RpcOutcomeListener<T> handler, Class<T> clazz, int coordinati

@Override
public void operationComplete(ChannelFuture future) throws Exception {
-
if (!future.isSuccess()) {

removeFromMap(coordinationId);
if (future.channel().isActive()) { - throw new RpcException("Future failed"); - }
else {
setException(new ChannelClosedException());
+ try { + removeFromMap(coordinationId); + }
finally {
+ final Throwable cause = future.cause();
+ if (future.channel().isActive()) {
+ setException(cause == null ? new RpcException("Future has failed") : cause);
- - End diff –

overall, LGTM. Some minor comments.

ASF GitHub Bot added a comment - 20/Jun/17 23:07 Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/857#discussion_r123118764 — Diff: exec/rpc/src/main/java/org/apache/drill/exec/rpc/RequestIdMap.java — @@ -111,13 +111,16 @@ public RpcListener(RpcOutcomeListener<T> handler, Class<T> clazz, int coordinati @Override public void operationComplete(ChannelFuture future) throws Exception { - if (!future.isSuccess()) { removeFromMap(coordinationId); if (future.channel().isActive()) { - throw new RpcException("Future failed"); - } else { setException(new ChannelClosedException()); + try { + removeFromMap(coordinationId); + } finally { + final Throwable cause = future.cause(); + if (future.channel().isActive()) { + setException(cause == null ? new RpcException("Future has failed") : cause); End diff – overall, LGTM. Some minor comments.

ASF GitHub Bot added a comment - 20/Jun/17 23:07

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/857#discussion_r123118252

— Diff: exec/rpc/src/main/java/org/apache/drill/exec/rpc/RequestIdMap.java —
@@ -111,13 +111,16 @@ public RpcListener(RpcOutcomeListener<T> handler, Class<T> clazz, int coordinati

@Override
public void operationComplete(ChannelFuture future) throws Exception {
-
if (!future.isSuccess()) {

removeFromMap(coordinationId);
if (future.channel().isActive()) { - throw new RpcException("Future failed"); - }
else {
setException(new ChannelClosedException());
+ try { + removeFromMap(coordinationId); + }
finally {
+ final Throwable cause = future.cause();
+ if (future.channel().isActive()) {
+ setException(cause == null ? new RpcException("Future has failed") : cause);
- - End diff –

Can we pick a better message ? "Future has failed" does not convey anything useful to the end user.

ASF GitHub Bot added a comment - 20/Jun/17 23:07 Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/857#discussion_r123118252 — Diff: exec/rpc/src/main/java/org/apache/drill/exec/rpc/RequestIdMap.java — @@ -111,13 +111,16 @@ public RpcListener(RpcOutcomeListener<T> handler, Class<T> clazz, int coordinati @Override public void operationComplete(ChannelFuture future) throws Exception { - if (!future.isSuccess()) { removeFromMap(coordinationId); if (future.channel().isActive()) { - throw new RpcException("Future failed"); - } else { setException(new ChannelClosedException()); + try { + removeFromMap(coordinationId); + } finally { + final Throwable cause = future.cause(); + if (future.channel().isActive()) { + setException(cause == null ? new RpcException("Future has failed") : cause); End diff – Can we pick a better message ? "Future has failed" does not convey anything useful to the end user.

ASF GitHub Bot added a comment - 20/Jun/17 23:07

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/857#discussion_r123118364

— Diff: exec/rpc/src/main/java/org/apache/drill/exec/rpc/RequestIdMap.java —
@@ -47,7 +47,7 @@
private final IntObjectHashMap<RpcOutcome<?>> map;

public RequestIdMap()

{ - map = new IntObjectHashMap<RpcOutcome<?>>(); + map = new IntObjectHashMap<>(); }

— End diff –

why do we need this change ?

ASF GitHub Bot added a comment - 20/Jun/17 23:07 Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/857#discussion_r123118364 — Diff: exec/rpc/src/main/java/org/apache/drill/exec/rpc/RequestIdMap.java — @@ -47,7 +47,7 @@ private final IntObjectHashMap<RpcOutcome<?>> map; public RequestIdMap() { - map = new IntObjectHashMap<RpcOutcome<?>>(); + map = new IntObjectHashMap<>(); } — End diff – why do we need this change ?

Khurram Faraaz added a comment - 20/Jun/17 23:58

arina verified fix, looks good on your private branch. None of the concurrent SQL were in CANCELLATION_REQUESTED state after a Ctrl-C was issued to the concurrent Java program from the command prompt, all queries were either in CANCELED or FAILED state.

Khurram Faraaz added a comment - 20/Jun/17 23:58 arina verified fix, looks good on your private branch. None of the concurrent SQL were in CANCELLATION_REQUESTED state after a Ctrl-C was issued to the concurrent Java program from the command prompt, all queries were either in CANCELED or FAILED state.

ASF GitHub Bot added a comment - 21/Jun/17 01:28

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/857#discussion_r123135163

— Diff: exec/rpc/src/main/java/org/apache/drill/exec/rpc/RequestIdMap.java —
@@ -47,7 +47,7 @@
private final IntObjectHashMap<RpcOutcome<?>> map;

public RequestIdMap()

{ - map = new IntObjectHashMap<RpcOutcome<?>>(); + map = new IntObjectHashMap<>(); }

— End diff –

This is code cleanup. Java can infer the type of the parameterized class without us telling it. Seems a similar cleanup was made a few lines below.

ASF GitHub Bot added a comment - 21/Jun/17 01:28

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/857#discussion_r123135561

— Diff: exec/rpc/src/main/java/org/apache/drill/exec/rpc/RequestIdMap.java —
@@ -111,13 +111,16 @@ public RpcListener(RpcOutcomeListener<T> handler, Class<T> clazz, int coordinati

@Override
public void operationComplete(ChannelFuture future) throws Exception {
-
if (!future.isSuccess()) {

removeFromMap(coordinationId);
if (future.channel().isActive()) { - throw new RpcException("Future failed"); - }
else {
setException(new ChannelClosedException());
+ try { + removeFromMap(coordinationId); + }
finally {
+ final Throwable cause = future.cause();
+ if (future.channel().isActive()) {
+ setException(cause == null ? new RpcException("Future has failed") : cause);
- - End diff –

I think the point is that the message is not sent to the end user, or written to the log. Rather, it is letting the listener know that the send failed. There is a reason for the failure. Ideally the `cause` is the real reason.

Still it would be good if we were to log such failures. So, maybe "Unknown channel failure" or some such might be helpful. Probably the developer will search the code to find where we raise that exception, so the message just has to be unique...

ASF GitHub Bot added a comment - 21/Jun/17 01:28 Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/857#discussion_r123135561 — Diff: exec/rpc/src/main/java/org/apache/drill/exec/rpc/RequestIdMap.java — @@ -111,13 +111,16 @@ public RpcListener(RpcOutcomeListener<T> handler, Class<T> clazz, int coordinati @Override public void operationComplete(ChannelFuture future) throws Exception { - if (!future.isSuccess()) { removeFromMap(coordinationId); if (future.channel().isActive()) { - throw new RpcException("Future failed"); - } else { setException(new ChannelClosedException()); + try { + removeFromMap(coordinationId); + } finally { + final Throwable cause = future.cause(); + if (future.channel().isActive()) { + setException(cause == null ? new RpcException("Future has failed") : cause); End diff – I think the point is that the message is not sent to the end user, or written to the log. Rather, it is letting the listener know that the send failed. There is a reason for the failure. Ideally the `cause` is the real reason. Still it would be good if we were to log such failures. So, maybe "Unknown channel failure" or some such might be helpful. Probably the developer will search the code to find where we raise that exception, so the message just has to be unique...

ASF GitHub Bot added a comment - 21/Jun/17 10:58

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/857

Reworded error message. @paul-rogers and @ppadma thanks for code review!

ASF GitHub Bot added a comment - 21/Jun/17 10:58 Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/857 Reworded error message. @paul-rogers and @ppadma thanks for code review!

ASF GitHub Bot added a comment - 25/Jun/17 01:32

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/857

ASF GitHub Bot added a comment - 25/Jun/17 01:32 Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/857

Arina Ielchiieva added a comment - 26/Jun/17 09:06

Merged into master with commit id 7e6571aa5d4c58185dbfa131de99354ea7dc6b4e

Arina Ielchiieva added a comment - 26/Jun/17 09:06 Merged into master with commit id 7e6571aa5d4c58185dbfa131de99354ea7dc6b4e

People

Assignee:: Arina Ielchiieva

Reporter:: Arina Ielchiieva

Reviewer:: Padma Penumarthy

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Jun/17 11:35

Updated:: 03/Aug/17 17:53

Resolved:: 26/Jun/17 09:05

Apache Drill

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates