[DRILL-5289] Drill should manage the heap memory so that we wouldn't hit an OOM due to insufficient heap - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.10.0
Fix Version/s: None
Component/s: Execution - Flow, Execution - RPC
Labels:
None

Description

[Git Commit ID will be updated soon]

The below query which uses the managed sort causes an OOM error due to insufficient heap, which is a bug in itself.

ALTER SESSION SET `exec.sort.disable_managed` = false;
+-------+-------------------------------------+
|  ok   |               summary               |
+-------+-------------------------------------+
| true  | exec.sort.disable_managed updated.  |
+-------+-------------------------------------+
1 row selected (1.096 seconds)
0: jdbc:drill:zk=10.10.100.183:5181> alter session set `planner.memory.max_query_memory_per_node` = 14106127360;
+-------+----------------------------------------------------+
|  ok   |                      summary                       |
+-------+----------------------------------------------------+
| true  | planner.memory.max_query_memory_per_node updated.  |
+-------+----------------------------------------------------+
1 row selected (0.253 seconds)
0: jdbc:drill:zk=10.10.100.183:5181> alter session set `planner.width.max_per_node` = 1;
+-------+--------------------------------------+
|  ok   |               summary                |
+-------+--------------------------------------+
| true  | planner.width.max_per_node updated.  |
+-------+--------------------------------------+
1 row selected (0.184 seconds)
0: jdbc:drill:zk=10.10.100.183:5181> select * from (select * from dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d where d.columns[0] = 'ljdfhwuehnoiueyf';

Once the OOM happens chaos follows

1. Dangling fragments are left behind
2. Query fails but zookeeper thinks its still running
3. Client connection timeouts
4. Profile page shows the same query as both running and failed.

We should be handling this situation more gracefully as this could be perceived as a drillbit stability issue. I attached the jstack. The logs and data set used are too big to upload here. Reach out to me if you need more information.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jstack.txt
22/Feb/17 18:51
71 kB
Rahul Kumar Challapalli
partial_log.txt
22/Feb/17 18:51
330 kB
Rahul Kumar Challapalli
Screen Shot 2017-02-22 at 10.58.39 AM (2).png
22/Feb/17 19:00
401 kB
Rahul Kumar Challapalli

Activity

People

Assignee:: Unassigned

Reporter:: Rahul Kumar Challapalli

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Feb/17 18:47

Updated:: 14/Mar/17 22:07