Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5289

Drill should manage the heap memory so that we wouldn't hit an OOM due to insufficient heap

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.10.0
    • Fix Version/s: None
    • Labels:
      None

      Description

      [Git Commit ID will be updated soon]

      The below query which uses the managed sort causes an OOM error due to insufficient heap, which is a bug in itself.

      ALTER SESSION SET `exec.sort.disable_managed` = false;
      +-------+-------------------------------------+
      |  ok   |               summary               |
      +-------+-------------------------------------+
      | true  | exec.sort.disable_managed updated.  |
      +-------+-------------------------------------+
      1 row selected (1.096 seconds)
      0: jdbc:drill:zk=10.10.100.183:5181> alter session set `planner.memory.max_query_memory_per_node` = 14106127360;
      +-------+----------------------------------------------------+
      |  ok   |                      summary                       |
      +-------+----------------------------------------------------+
      | true  | planner.memory.max_query_memory_per_node updated.  |
      +-------+----------------------------------------------------+
      1 row selected (0.253 seconds)
      0: jdbc:drill:zk=10.10.100.183:5181> alter session set `planner.width.max_per_node` = 1;
      +-------+--------------------------------------+
      |  ok   |               summary                |
      +-------+--------------------------------------+
      | true  | planner.width.max_per_node updated.  |
      +-------+--------------------------------------+
      1 row selected (0.184 seconds)
      0: jdbc:drill:zk=10.10.100.183:5181> select * from (select * from dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d where d.columns[0] = 'ljdfhwuehnoiueyf';
      

      Once the OOM happens chaos follows

      1. Dangling fragments are left behind
      2. Query fails but zookeeper thinks its still running
      3. Client connection timeouts
      4. Profile page shows the same query as both running and failed.
      

      We should be handling this situation more gracefully as this could be perceived as a drillbit stability issue. I attached the jstack. The logs and data set used are too big to upload here. Reach out to me if you need more information.

        Attachments

        1. Screen Shot 2017-02-22 at 10.58.39 AM (2).png
          401 kB
          Rahul Kumar Challapalli
        2. partial_log.txt
          330 kB
          Rahul Kumar Challapalli
        3. jstack.txt
          71 kB
          Rahul Kumar Challapalli

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rkins Rahul Kumar Challapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: