Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-1469

allocateQueryMaster can leak resources if it times-out (3sec, hardcoded)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11.0
    • Component/s: None
    • Labels:
      None

      Description

          WorkerResourceAllocationResponse response = null;
          try {
            response = callFuture.get(3, TimeUnit.SECONDS);
          } catch (Throwable t) {
            LOG.error(t, t);
            return null;
          }
      

      If it times-out (or interrupted), allocated resources cannot be retrieved forever.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user navis opened a pull request:

          https://github.com/apache/tajo/pull/480

          TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded)

          Tested with some hack to reproduce timeout.
          ```java
          WorkerResourceAllocationResponse response = null;
          try

          { response = callFuture.get(3, queryInProgress.getQueryId().getSeq() < 3 ? TimeUnit.MICROSECONDS : TimeUnit.SECONDS); }

          catch (Throwable t)

          { .... }

          ```

          exceptions for first two but successes for all following queries.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/navis/tajo TAJO-1469

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tajo/pull/480.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #480


          commit 7700076e922a429b0ff5179052616c89afd5665d
          Author: navis.ryu <navis@apache.org>
          Date: 2015-03-30T07:06:08Z

          TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded)


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user navis opened a pull request: https://github.com/apache/tajo/pull/480 TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded) Tested with some hack to reproduce timeout. ```java WorkerResourceAllocationResponse response = null; try { response = callFuture.get(3, queryInProgress.getQueryId().getSeq() < 3 ? TimeUnit.MICROSECONDS : TimeUnit.SECONDS); } catch (Throwable t) { .... } ``` exceptions for first two but successes for all following queries. You can merge this pull request into a Git repository by running: $ git pull https://github.com/navis/tajo TAJO-1469 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tajo/pull/480.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #480 commit 7700076e922a429b0ff5179052616c89afd5665d Author: navis.ryu <navis@apache.org> Date: 2015-03-30T07:06:08Z TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user babokim commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/480#discussion_r27793934

          — Diff: tajo-core/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java —
          @@ -221,11 +231,16 @@ public WorkerAllocatedResource allocateQueryMaster(QueryInProgress queryInProgre
          try

          { response = callFuture.get(3, TimeUnit.SECONDS); }

          catch (Throwable t) {

          • LOG.error(t, t);
          • return null;
            + response = callFuture.cancel(); // try cancel
            + if (response == null) {
            + // canceled future
            + LOG.warn("Got exception " + t);
            + LOG.error(t, t);
              • End diff –

          I think this line is not necessary if you change as the following in the immediately preceding line.
          ```
          LOG.warn("Got exception while allocating QueryMaster: " + t, t);
          ```

          Show
          githubbot ASF GitHub Bot added a comment - Github user babokim commented on a diff in the pull request: https://github.com/apache/tajo/pull/480#discussion_r27793934 — Diff: tajo-core/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java — @@ -221,11 +231,16 @@ public WorkerAllocatedResource allocateQueryMaster(QueryInProgress queryInProgre try { response = callFuture.get(3, TimeUnit.SECONDS); } catch (Throwable t) { LOG.error(t, t); return null; + response = callFuture.cancel(); // try cancel + if (response == null) { + // canceled future + LOG.warn("Got exception " + t); + LOG.error(t, t); End diff – I think this line is not necessary if you change as the following in the immediately preceding line. ``` LOG.warn("Got exception while allocating QueryMaster: " + t, t); ```
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user babokim commented on the pull request:

          https://github.com/apache/tajo/pull/480#issuecomment-90046382

          I investigated the lock of CallFuture. CallFuture should be synchronized with run() and get(). Current code looks like this would be implemented but not. If the following situation is occur, some resources or tasks will be lost forever.

          1. Worker: TaskRunner sends GetTask request.
          2. QM: QueryMaster selects proper task and calls RpcCallback.
          3. Worker: AsyncRpcClient receives the response and calls CallFuture.run(response).
          3-1. Worker: If TimeoutException occurs after 1) between 2) ~ 3), TaskRunner can't receive the response and doesn't run the allocated task, but QM doesn't know about that.

          If my thought is wrong, please let me know.
          If my thought is right, this patch is temporary solution and we need to create another issue for this problem.

          Show
          githubbot ASF GitHub Bot added a comment - Github user babokim commented on the pull request: https://github.com/apache/tajo/pull/480#issuecomment-90046382 I investigated the lock of CallFuture. CallFuture should be synchronized with run() and get(). Current code looks like this would be implemented but not. If the following situation is occur, some resources or tasks will be lost forever. 1. Worker: TaskRunner sends GetTask request. 2. QM: QueryMaster selects proper task and calls RpcCallback. 3. Worker: AsyncRpcClient receives the response and calls CallFuture.run(response). 3-1. Worker: If TimeoutException occurs after 1) between 2) ~ 3), TaskRunner can't receive the response and doesn't run the allocated task, but QM doesn't know about that. If my thought is wrong, please let me know. If my thought is right, this patch is temporary solution and we need to create another issue for this problem.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user babokim commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/480#discussion_r27795231

          — Diff: tajo-rpc/tajo-rpc-protobuf/src/main/java/org/apache/tajo/rpc/SimpleExchange.java —
          @@ -0,0 +1,67 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.tajo.rpc;
          +
          +import com.google.protobuf.RpcCallback;
          +
          +import java.util.concurrent.Semaphore;
          +import java.util.concurrent.TimeUnit;
          +import java.util.concurrent.TimeoutException;
          +import java.util.concurrent.atomic.AtomicBoolean;
          +
          +public abstract class SimpleExchange<T> implements RpcCallback<T> {
          — End diff –

          I think that SimpleExchange name is not proper this class. How about "CancelableRpcCallback"?

          Show
          githubbot ASF GitHub Bot added a comment - Github user babokim commented on a diff in the pull request: https://github.com/apache/tajo/pull/480#discussion_r27795231 — Diff: tajo-rpc/tajo-rpc-protobuf/src/main/java/org/apache/tajo/rpc/SimpleExchange.java — @@ -0,0 +1,67 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.rpc; + +import com.google.protobuf.RpcCallback; + +import java.util.concurrent.Semaphore; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.TimeoutException; +import java.util.concurrent.atomic.AtomicBoolean; + +public abstract class SimpleExchange<T> implements RpcCallback<T> { — End diff – I think that SimpleExchange name is not proper this class. How about "CancelableRpcCallback"?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user babokim commented on the pull request:

          https://github.com/apache/tajo/pull/480#issuecomment-90051805

          If this patch is considered as a temporary solution, looks good to me.
          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user babokim commented on the pull request: https://github.com/apache/tajo/pull/480#issuecomment-90051805 If this patch is considered as a temporary solution, looks good to me. +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user navis commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/480#discussion_r27837564

          — Diff: tajo-rpc/tajo-rpc-protobuf/src/main/java/org/apache/tajo/rpc/SimpleExchange.java —
          @@ -0,0 +1,67 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.tajo.rpc;
          +
          +import com.google.protobuf.RpcCallback;
          +
          +import java.util.concurrent.Semaphore;
          +import java.util.concurrent.TimeUnit;
          +import java.util.concurrent.TimeoutException;
          +import java.util.concurrent.atomic.AtomicBoolean;
          +
          +public abstract class SimpleExchange<T> implements RpcCallback<T> {
          — End diff –

          ok

          Show
          githubbot ASF GitHub Bot added a comment - Github user navis commented on a diff in the pull request: https://github.com/apache/tajo/pull/480#discussion_r27837564 — Diff: tajo-rpc/tajo-rpc-protobuf/src/main/java/org/apache/tajo/rpc/SimpleExchange.java — @@ -0,0 +1,67 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.rpc; + +import com.google.protobuf.RpcCallback; + +import java.util.concurrent.Semaphore; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.TimeoutException; +import java.util.concurrent.atomic.AtomicBoolean; + +public abstract class SimpleExchange<T> implements RpcCallback<T> { — End diff – ok
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user navis commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/480#discussion_r27837678

          — Diff: tajo-core/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java —
          @@ -221,11 +231,16 @@ public WorkerAllocatedResource allocateQueryMaster(QueryInProgress queryInProgre
          try

          { response = callFuture.get(3, TimeUnit.SECONDS); }

          catch (Throwable t) {

          • LOG.error(t, t);
          • return null;
            + response = callFuture.cancel(); // try cancel
            + if (response == null) {
            + // canceled future
            + LOG.warn("Got exception " + t);
            + LOG.error(t, t);
              • End diff –

          This (response != null in catch block) means response is acquired after timeout. I don't think it should be logged even if we got expected result in time.

          Show
          githubbot ASF GitHub Bot added a comment - Github user navis commented on a diff in the pull request: https://github.com/apache/tajo/pull/480#discussion_r27837678 — Diff: tajo-core/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java — @@ -221,11 +231,16 @@ public WorkerAllocatedResource allocateQueryMaster(QueryInProgress queryInProgre try { response = callFuture.get(3, TimeUnit.SECONDS); } catch (Throwable t) { LOG.error(t, t); return null; + response = callFuture.cancel(); // try cancel + if (response == null) { + // canceled future + LOG.warn("Got exception " + t); + LOG.error(t, t); End diff – This (response != null in catch block) means response is acquired after timeout. I don't think it should be logged even if we got expected result in time.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user navis commented on the pull request:

          https://github.com/apache/tajo/pull/480#issuecomment-90269031

          Yes, if resource allocation is done over network, we need more serious work to fix that. If the problem is recognized fully by this simple, in-process fix, I have a intent to fix that, too.

          Show
          githubbot ASF GitHub Bot added a comment - Github user navis commented on the pull request: https://github.com/apache/tajo/pull/480#issuecomment-90269031 Yes, if resource allocation is done over network, we need more serious work to fix that. If the problem is recognized fully by this simple, in-process fix, I have a intent to fix that, too.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jihoonson commented on the pull request:

          https://github.com/apache/tajo/pull/480#issuecomment-90296971

          Thanks @navis! This is a critical problem that should be fixed.
          You seem to handle a somple case in this issue. If so, would you book other remaining issues on Jira? It will be helpful to remind us.
          Thanks.

          Show
          githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on the pull request: https://github.com/apache/tajo/pull/480#issuecomment-90296971 Thanks @navis! This is a critical problem that should be fixed. You seem to handle a somple case in this issue. If so, would you book other remaining issues on Jira? It will be helpful to remind us. Thanks.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user navis commented on the pull request:

          https://github.com/apache/tajo/pull/480#issuecomment-90359286

          Addressed comments & added "tajo.qm.resource.allocation.timeout" to set timeout.

          Show
          githubbot ASF GitHub Bot added a comment - Github user navis commented on the pull request: https://github.com/apache/tajo/pull/480#issuecomment-90359286 Addressed comments & added "tajo.qm.resource.allocation.timeout" to set timeout.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user navis commented on the pull request:

          https://github.com/apache/tajo/pull/480#issuecomment-90364307

          @jihoonson Would you mind to review TAJO-1385 first? it seemed it would overlap in some part.

          Show
          githubbot ASF GitHub Bot added a comment - Github user navis commented on the pull request: https://github.com/apache/tajo/pull/480#issuecomment-90364307 @jihoonson Would you mind to review TAJO-1385 first? it seemed it would overlap in some part.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jihoonson commented on the pull request:

          https://github.com/apache/tajo/pull/480#issuecomment-90364612

          Ok. I'll review today.

          Show
          githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on the pull request: https://github.com/apache/tajo/pull/480#issuecomment-90364612 Ok. I'll review today.
          Hide
          cepiross Jinhang Choi added a comment -

          Maybe TAJO-1146 is directly related to this issue, as a symptom of leakage.
          After this patch is released on master branch, I will test it to check TAJO-1146 resolution.

          Thank a lot for your contributions!

          Show
          cepiross Jinhang Choi added a comment - Maybe TAJO-1146 is directly related to this issue, as a symptom of leakage. After this patch is released on master branch, I will test it to check TAJO-1146 resolution. Thank a lot for your contributions!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user babokim commented on the pull request:

          https://github.com/apache/tajo/pull/480#issuecomment-90771564

          +1
          All tests passed. I'll commit soon.

          Show
          githubbot ASF GitHub Bot added a comment - Github user babokim commented on the pull request: https://github.com/apache/tajo/pull/480#issuecomment-90771564 +1 All tests passed. I'll commit soon.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tajo/pull/480

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tajo/pull/480
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Tajo-master-CODEGEN-build #297 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/297/)
          TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded) (babokim: rev 69895422cf2f770d897a8cbe554ea739cd091b20)

          • tajo-rpc/tajo-rpc-protobuf/src/main/java/org/apache/tajo/rpc/CancelableRpcCallback.java
          • CHANGES
          • tajo-core/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java
          • tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Tajo-master-CODEGEN-build #297 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/297/ ) TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded) (babokim: rev 69895422cf2f770d897a8cbe554ea739cd091b20) tajo-rpc/tajo-rpc-protobuf/src/main/java/org/apache/tajo/rpc/CancelableRpcCallback.java CHANGES tajo-core/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Tajo-master-build #659 (See https://builds.apache.org/job/Tajo-master-build/659/)
          TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded) (babokim: rev 69895422cf2f770d897a8cbe554ea739cd091b20)

          • tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java
          • tajo-rpc/tajo-rpc-protobuf/src/main/java/org/apache/tajo/rpc/CancelableRpcCallback.java
          • tajo-core/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java
          • CHANGES
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #659 (See https://builds.apache.org/job/Tajo-master-build/659/ ) TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded) (babokim: rev 69895422cf2f770d897a8cbe554ea739cd091b20) tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java tajo-rpc/tajo-rpc-protobuf/src/main/java/org/apache/tajo/rpc/CancelableRpcCallback.java tajo-core/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java CHANGES

            People

            • Assignee:
              navis Navis
              Reporter:
              navis Navis
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development