Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-587

Query is hanging when OutOfMemoryError occurs in the query master

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: TajoMaster
    • Labels:
      None

      Description

      See the title. When I run a simple sort query against a table of 1TB, the query is hanging and not finished.
      Queries should be terminated immediately when OOME occurs.

      tajo> select l_orderkey from lineitem order by l_orderkey
      
      2014-02-05 17:20:52,339 FATAL master.TajoAsyncDispatcher (TajoAsyncDispatcher.java:dispatch(143)) - Error in dispatcher thread:SUBQUERY_COMPLETED
      java.lang.OutOfMemoryError: GC overhead limit exceeded
              at java.net.URI.create(URI.java:857)
              at org.apache.tajo.master.querymaster.Repartitioner.scheduleRangeShuffledFetches(Repartitioner.java:342)
              at org.apache.tajo.master.querymaster.Repartitioner.scheduleFragmentsForNonLeafTasks(Repartitioner.java:261)
              at org.apache.tajo.master.querymaster.SubQuery$InitAndRequestContainer.schedule(SubQuery.java:680)
              at org.apache.tajo.master.querymaster.SubQuery$InitAndRequestContainer.transition(SubQuery.java:523)
              at org.apache.tajo.master.querymaster.SubQuery$InitAndRequestContainer.transition(SubQuery.java:504)
              at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
              at org.apache.tajo.master.querymaster.SubQuery.handle(SubQuery.java:481)
              at org.apache.tajo.master.querymaster.Query$SubQueryCompletedTransition.executeNextBlock(Query.java:311)
              at org.apache.tajo.master.querymaster.Query$SubQueryCompletedTransition.transition(Query.java:357)
              at org.apache.tajo.master.querymaster.Query$SubQueryCompletedTransition.transition(Query.java:297)
              at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
              at org.apache.tajo.master.querymaster.Query.handle(Query.java:584)
              at org.apache.tajo.master.querymaster.Query.handle(Query.java:58)
              at org.apache.tajo.master.TajoAsyncDispatcher.dispatch(TajoAsyncDispatcher.java:137)
              at org.apache.tajo.master.TajoAsyncDispatcher$1.run(TajoAsyncDispatcher.java:79)
              at java.lang.Thread.run(Thread.java:701)
      2014-02-05 17:20:52,339 WARN  querymaster.QueryMaster (QueryMaster.java:run(459)) - Query q_1391587770871_0001 stopped cause query sesstion timeout: 384113 ms
      2014-02-05 17:20:52,339 INFO  querymaster.QueryMasterTask (QueryMasterTask.java:stop(168)) - Stopping QueryMasterTask:q_1391587770871_0001
      2014-02-05 17:20:52,346 INFO  master.TajoAsyncDispatcher (TajoAsyncDispatcher.java:stop(122)) - AsyncDispatcher stopped:q_1391587770871_0001
      2014-02-05 17:20:52,351 INFO  querymaster.QueryMasterTask (QueryMasterTask.java:stop(198)) - Stopped QueryMasterTask:q_1391587770871_0001
      2014-02-05 17:23:28,614 ERROR worker.TajoWorker (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM
      

        Activity

        Hide
        hyunsik Hyunsik Choi added a comment -

        There may be many rooms for improvement in the method scheduleRangeShuffledFetces(). First of all, we should use just hostname, several intergers indicating subquery id, task id, and attempt id, instead of URI. It will significantly reduce the main memory usage.

        As an temporary solution, you also can set more memory to TAJO_WORKER_HEAPSIZE. It would be helpful depending on your environment.

        Show
        hyunsik Hyunsik Choi added a comment - There may be many rooms for improvement in the method scheduleRangeShuffledFetces() . First of all, we should use just hostname, several intergers indicating subquery id, task id, and attempt id, instead of URI. It will significantly reduce the main memory usage. As an temporary solution, you also can set more memory to TAJO_WORKER_HEAPSIZE. It would be helpful depending on your environment.
        Hide
        jihoonson Jihoon Son added a comment - - edited

        This is not related to memory issues.
        Queris should be terminated immediately when OOME occurs in the query master, but they are hanging.

        Show
        jihoonson Jihoon Son added a comment - - edited This is not related to memory issues. Queris should be terminated immediately when OOME occurs in the query master, but they are hanging.
        Hide
        hyunsik Hyunsik Choi added a comment -

        Your point is that a failed query should be stopped immediate instead of hanging. Is it right? I didn't get your point because this issue showed a situation without a proposal.

        Nevertheless, I still think that OOM caused by this situation is very useful report for us, and we have to resolve this problem. If OOM still occurs, we still cannot execute this kind query, even though we fix the haning problem. Of course, we also have to fix the hanging problem by OOM too.

        Thank you for this report.

        Show
        hyunsik Hyunsik Choi added a comment - Your point is that a failed query should be stopped immediate instead of hanging. Is it right? I didn't get your point because this issue showed a situation without a proposal. Nevertheless, I still think that OOM caused by this situation is very useful report for us, and we have to resolve this problem. If OOM still occurs, we still cannot execute this kind query, even though we fix the haning problem. Of course, we also have to fix the hanging problem by OOM too. Thank you for this report.
        Hide
        hyunsik Hyunsik Choi added a comment -

        I've investigated OOME in java shortly. The following links may be useful for this issue.

        According to the above links, the recover from OOME looks hard. The graceful shutdown looks the best way.

        If there are alternative ways, please share them here.

        Show
        hyunsik Hyunsik Choi added a comment - I've investigated OOME in java shortly. The following links may be useful for this issue. http://stackoverflow.com/questions/3058198/can-the-jvm-recover-from-an-outofmemoryerror-without-a-restart http://stackoverflow.com/questions/2679330/catching-java-lang-outofmemoryerror According to the above links, the recover from OOME looks hard. The graceful shutdown looks the best way. If there are alternative ways, please share them here.
        Hide
        jihoonson Jihoon Son added a comment -

        Thanks, Hyunsik.
        I added a proposal to the issue description.
        Also, I'll investigate your links and more ways.

        Show
        jihoonson Jihoon Son added a comment - Thanks, Hyunsik. I added a proposal to the issue description. Also, I'll investigate your links and more ways.
        Hide
        hyunsik Hyunsik Choi added a comment -

        Currently, this kind of problems have been fixed. The stability of QueryMaster has been improved significantly. So, I haven't meet such a problem for long time. Now, I fix it as not a problem. When we meet this kind of problem, we can create another jira.

        Show
        hyunsik Hyunsik Choi added a comment - Currently, this kind of problems have been fixed. The stability of QueryMaster has been improved significantly. So, I haven't meet such a problem for long time. Now, I fix it as not a problem. When we meet this kind of problem, we can create another jira.

          People

          • Assignee:
            Unassigned
            Reporter:
            jihoonson Jihoon Son
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development