Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5669

Multiple TPCH queries failed due to OOM

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11.0
    • Component/s: Functions - Drill
    • Labels:
    • Environment:

      RHEL 6.4 2.6.32-358.el6.x86_64, 10+1 nodes cluster

      Description

      Running TPCH SF100 Parquet (and CSV) tests, multiple queries failed due to OOM. For example, Q16 hit the following error:

      java.sql.SQLException: RESOURCE ERROR: One or more nodes ran out of memory while executing the query.
      
      Unable to allocate sv2 for 65536 records, and not enough batchGroups to spill.
      batchGroups.size 1
      spilledBatchGroups.size 0
      allocated memory 23500416
      allocator limit 20000000
      Fragment 1:11
      
      [Error Id: e58161a6-2383-48b1-a350-50db1b5408c6 on ucs-node10.perf.lab:31010]
              at org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:489)
              at org.apache.drill.jdbc.impl.DrillCursor.next(DrillCursor.java:593)
              at org.apache.calcite.avatica.AvaticaResultSet.next(AvaticaResultSet.java:215)
              at org.apache.drill.jdbc.impl.DrillResultSetImpl.next(DrillResultSetImpl.java:140)
              at PipSQueak.fetchRows(PipSQueak.java:420)
              at PipSQueak.runTest(PipSQueak.java:116)
              at PipSQueak.main(PipSQueak.java:556)
      Caused by: org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR: One or more nodes ran out of memory while executing the query.
      
      Unable to allocate sv2 for 65536 records, and not enough batchGroups to spill.
      batchGroups.size 1
      spilledBatchGroups.size 0
      allocated memory 23500416
      allocator limit 20000000
      Fragment 1:11
      

      And in drillbit.log:

      2017-07-12 11:34:11,670 ucs-node10.perf.lab [26999476-174e-98fd-e21e-fd53f79284c7:frag:1:11] INFO  o.a.d.e.p.i.xsort.ExternalSortBatch - User Error Occurred: One or more nodes ran out of memory while executing the query.
      org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or more nodes ran out of memory while executing the query.
      
      Unable to allocate sv2 for 65536 records, and not enough batchGroups to spill.
      batchGroups.size 1
      spilledBatchGroups.size 0
      allocated memory 23500416
      allocator limit 20000000
      
      [Error Id: e58161a6-2383-48b1-a350-50db1b5408c6 ]
              at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:550) ~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.newSV2(ExternalSortBatch.java:639) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext(ExternalSortBatch.java:381) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:140) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:144) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at java.security.AccessController.doPrivileged(Native Method) [na:1.7.0_65]
              at javax.security.auth.Subject.doAs(Subject.java:415) [na:1.7.0_65]
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595) [hadoop-common-2.7.0-mapr-1607.jar:na]
              at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65]
              at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
      
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ben-zvi Boaz Ben-Zvi
                Reporter:
                dechanggu Dechang Gu
                Reviewer:
                Paul Rogers
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 72h
                  72h
                  Remaining:
                  Remaining Estimate - 72h
                  72h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified