Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-2907

Drill performance degrades significantly over time - resource leak

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.9.0
    • 1.0.0
    • Functions - Drill
    • None

    Description

      #Generated by Git-Commit-Id-Plugin
      #Mon Apr 27 23:19:19 EDT 2015
      git.commit.id.abbrev=57a96d2
      git.commit.user.email=smp@apache.org
      git.commit.message.full=DRILL-2083\: Fix bug in merging receiver\n
      git.commit.id=57a96d200e12c0efcad3f3ca9d935c42647234b1

      Running Functional/Passing/complex suit in a loop, on a four node cluster, with cmd:

      for i in `seq 1 1000`; do echo $i; time ./run.sh -s Functional/Passing/complex/json -g smoke,functional,regression -t 400 -n 10;sleep 1;done

      Initially, one iteration took about 1 min, but after 300 iterations, it took about 38 min, considerably longer, and about 10% tests timeout.

      Drillbit memory consumption increased from less than 1G initially to more than 5G. Even after I stopped the test, drillbit CPU still spinning.

      Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
      Cpu(s): 11.5%us,  0.0%sy,  0.0%ni, 88.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
      Mem:  49416696k total, 28344300k used, 21072396k free,   218004k buffers
      Swap: 52428792k total,        0k used, 52428792k free,  1343096k cached
      
        PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
       9889 mapr      20   0 7360m 5.1g  37m S 275.7 10.9   2077:45 java
      

      I see the following exception in drill log:

      2015-04-29 09:29:54,679 [2abefae4-f2f5-6db8-c60d-6bfb30d2fcbf:frag:0:0] INFO  o.a.drill.exec.work.foreman.Foreman - foreman cleaning up.
      2015-04-29 09:29:54,713 [2abefa81-6079-5166-aeab-e9fd3d450af6:frag:0:0] INFO  o.a.d.e.w.fragment.FragmentExecutor - 2abefa81-6079-5166-aeab-e9fd3d450af6:0:0: State change requested from CANCELLATION_REQUESTED --> FAILED for
      2015-04-29 09:29:54,717 [2abefa81-6079-5166-aeab-e9fd3d450af6:frag:0:0] ERROR o.a.d.c.e.DrillRuntimeException - SYSTEM ERROR: You tried to do a batch data read operation when you were in a state of STOP.  You can only do this type of operation when you are in a state of OK or OK_NEW_SCHEMA.
      
      Fragment 0:0
      
      [468261de-df41-48fa-89f7-55a9cf23fc49 on qa-node120.qa.lab:31010]
      org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: You tried to do a batch data read operation when you were in a state of STOP.  You can only do this type of operation when you are in a state of OK or OK_NEW_SCHEMA.
      
      Fragment 0:0
      
      [468261de-df41-48fa-89f7-55a9cf23fc49 on qa-node120.qa.lab:31010]
              at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:465) ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:248) [drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:218) [drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
              at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
      Caused by: java.lang.IllegalStateException: You tried to do a batch data read operation when you were in a state of STOP.  You can only do this type of operation when you are in a state of OK or OK_NEW_SCHEMA.
              at org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.validateReadState(IteratorValidatorBatchIterator.java:54) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.getValueVectorId(IteratorValidatorBatchIterator.java:103) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.expr.ExpressionTreeMaterializer$MaterializeVisitor.visitSchemaPath(ExpressionTreeMaterializer.java:423) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.expr.ExpressionTreeMaterializer$MaterializeVisitor.visitSchemaPath(ExpressionTreeMaterializer.java:188) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.common.expression.SchemaPath.accept(SchemaPath.java:160) ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.expr.ExpressionTreeMaterializer.materialize(ExpressionTreeMaterializer.java:104) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.expr.ExpressionTreeMaterializer.materialize(ExpressionTreeMaterializer.java:90) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.join.MergeJoinBatch.materializeExpression(MergeJoinBatch.java:547) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.join.MergeJoinBatch.generateNewWorker(MergeJoinBatch.java:372) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.join.MergeJoinBatch.innerNext(MergeJoinBatch.java:188) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:144) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:101) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:91) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:130) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:144) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:101) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:91) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:130) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:144) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:74) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:80) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:64) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:198) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:192) ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at java.security.AccessController.doPrivileged(Native Method) ~[na:1.7.0_45]
              at javax.security.auth.Subject.doAs(Subject.java:415) ~[na:1.7.0_45]
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469) ~[hadoop-common-2.4.1-mapr-1408.jar:na]
              at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:192) [drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              ... 4 common frames omitted
      2015-04-29 09:29:54,721 [2abefa81-6079-5166-aeab-e9fd3d450af6:frag:0:0] INFO  o.a.drill.exec.work.foreman.Foreman - State change requested.  CANCELLATION_REQUESTED --> FAILED
      org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: You tried to do a batch data read operation when you were in a state of STOP.  You can only do this type of operation when you are in a state of OK or OK_NEW_SCHEMA.
      
      Fragment 0:0
      
      [468261de-df41-48fa-89f7-55a9cf23fc49 on qa-node120.qa.lab:31010]
              at org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:409) [drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.work.foreman.QueryManager$RootStatusReporter.statusChange(QueryManager.java:389) [drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.AbstractStatusReporter.fail(AbstractStatusReporter.java:90) [drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.AbstractStatusReporter.fail(AbstractStatusReporter.java:86) [drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:252) [drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:218) [drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
              at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
      2015-04-29 09:29:54,722 [2abefa81-6079-5166-aeab-e9fd3d450af6:frag:0:0] INFO  o.a.drill.exec.work.foreman.Foreman - foreman cleaning up.
      2015-04-29 09:29:54,738 [2abefa81-6079-5166-aeab-e9fd3d450af6:frag:0:0] INFO  o.a.drill.exec.work.foreman.Foreman - State change requested.  CANCELED --> COMPLETED
      2015-04-29 09:29:54,738 [2abefa81-6079-5166-aeab-e9fd3d450af6:frag:0:0] WARN  o.a.drill.exec.work.foreman.Foreman - Dropping request to move to COMPLETED state as query is already at CANCELED state (which is terminal).
      2015-04-29 09:29:54,760 [2abefa80-9f9b-00de-4655-c4620c86096a:frag:0:0] INFO  o.a.d.e.w.fragment.FragmentExecutor - 2abefa80-9f9b-00de-4655-c4620c86096a:0:0: State change requested from CANCELLATION_REQUESTED --> FINISHED for
      2015-04-29 09:29:54,760 [2abefa80-9f9b-00de-4655-c4620c86096a:frag:0:0] INFO  o.a.d.e.w.f.AbstractStatusReporter - State changed for 2abefa80-9f9b-00de-4655-c4620c86096a:0:0. New state: CANCELLED
      2015-04-29 09:29:54,760 [2abefa80-9f9b-00de-4655-c4620c86096a:frag:0:0] INFO  o.a.drill.exec.work.foreman.Foreman - State change requested.  CANCELLATION_REQUESTED --> COMPLETED
      2015-04-29 09:29:54,761 [2abefa80-9f9b-00de-4655-c4620c86096a:frag:0:0] INFO  o.a.drill.exec.work.foreman.Foreman - foreman cleaning up.
      

      Attachments

        1. drill.jstack
          52 kB
          Chun Chang
        2. drill.native_memory
          2 kB
          Chun Chang

        Activity

          jnadeau Jacques Nadeau added a comment -

          Resolved through various stability fixes over last week.

          jnadeau Jacques Nadeau added a comment - Resolved through various stability fixes over last week.
          cchang@maprtech.com Chun Chang added a comment -

          Ran same test on 1.0 GA build for over 72 hours. No memory/resource issues noticed.

          cchang@maprtech.com Chun Chang added a comment - Ran same test on 1.0 GA build for over 72 hours. No memory/resource issues noticed.

          People

            cchang@maprtech.com Chun Chang
            cchang@maprtech.com Chun Chang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: