[DRILL-3648] NTILE function returns incorrect results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: 1.2.0
Component/s: Execution - Flow
Labels:
- window_function
Environment:

private-branch https://github.com/adeneche/incubator-drill/tree/new-window-funcs

Description

NTILE function returns incorrect results for larger dataset. I am working on reproducing the problem with a smaller dataset.

The inner query that uses NTILE should have divided the rows into two sets (tiles) where each tile consists of (937088 + 1 ) rows , 937088 rows

0: jdbc:drill:schema=dfs.tmp> select ntile_key2, count(ntile_key2) from (select ntile(2) over(partition by key2 order by key1) ntile_key2 from `twoKeyJsn.json` where key2 = 'm') group by ntile_key2;
+-------------+----------+
| ntile_key2  |  EXPR$1  |
+-------------+----------+
| 1           | 1        |
| 2           | 1874176  |
+-------------+----------+
2 rows selected (49.406 seconds)

Explain plan for inner query that returns wrong results.

0: jdbc:drill:schema=dfs.tmp> explain plan for select ntile(2) over(partition by key2 order by key1) from `twoKeyJsn.json` where key2 = 'm';
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      UnionExchange
01-01        Project(EXPR$0=[$0])
01-02          Project($0=[$2])
01-03            Window(window#0=[window(partition {0} order by [1] range between UNBOUNDED PRECEDING and CURRENT ROW aggs [NTILE($2)])])
01-04              SelectionVectorRemover
01-05                Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
01-06                  Project(key2=[$0], key1=[$1])
01-07                    HashToRandomExchange(dist0=[[$0]])
02-01                      UnorderedMuxExchange
03-01                        Project(key2=[$0], key1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
03-02                          SelectionVectorRemover
03-03                            Filter(condition=[=($0, 'm')])
03-04                              Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/tmp/twoKeyJsn.json, numFiles=1, columns=[`key2`, `key1`], files=[maprfs:///tmp/twoKeyJsn.json]]])

Total number of rows in partition that has key2 = 'm'

0: jdbc:drill:schema=dfs.tmp> select count(key1) from `twoKeyJsn.json` where key2 = 'm';
+----------+
|  EXPR$0  |
+----------+
| 1874177  |
+----------+
1 row selected (37.581 seconds)

Attachments

Issue Links

Is contained by

DRILL-3536 Add support for LEAD, LAG, NTILE, FIRST_VALUE and LAST_VALUE window functions

Closed

Activity

People

Assignee:: Abdel Hakim Deneche

Reporter:: Khurram Faraaz

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Aug/15 17:12

Updated:: 01/Sep/15 22:25

Resolved:: 18/Aug/15 17:19