Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.0.0
-
1de6aed93efce8a524964371d96673b8ef192d89
Description
Query that uses window functions takes too long to complete and return results. It returns close to a million records, for which it took 533.8 seconds ~8 minutes
Input CSV file has two columns, one integer and another varchar type column. Please take a look.
Size of the input CSV file
root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
-rwxr-xr-x 3 root root 27889455 2015-06-10 01:26 /tmp/manyDuplicates.csv
select count(*) over(partition by cast(columns[1] as varchar(25)) order by cast(columns[0] as bigint)) from `manyDuplicates.csv`; ... 1,000,007 rows selected (533.857 seconds)
There are five distinct values in columns[1] in the CSV file. = [FIVE PARTITIONS]
0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from `manyDuplicates.csv`; +-----------------------+ | EXPR$0 | +-----------------------+ | FFFFGGGGHHHHIIIIJJJJ | | PPPPQQQQRRRRSSSSTTTT | | AAAABBBBCCCCDDDDEEEE | | UUUUVVVVWWWWXXXXZZZZ | | KKKKLLLLMMMMNNNNOOOO | +-----------------------+ 5 rows selected (1.906 seconds)
Here is the count for each of those values in columns[1]
0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
+---------+
| EXPR$0 |
+---------+
| 200484 |
+---------+
1 row selected (0.961 seconds)
0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
+---------+
| EXPR$0 |
+---------+
| 199353 |
+---------+
1 row selected (0.86 seconds)
0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
+---------+
| EXPR$0 |
+---------+
| 200702 |
+---------+
1 row selected (0.826 seconds)
0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
+---------+
| EXPR$0 |
+---------+
| 199916 |
+---------+
1 row selected (0.851 seconds)
0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
+---------+
| EXPR$0 |
+---------+
| 199552 |
+---------+
1 row selected (0.827 seconds)
Query plan for the long running query
| 00-00 Screen 00-01 UnionExchange 01-01 Project(EXPR$0=[$0]) 01-02 Project($0=[$2]) 01-03 Window(window#0=[window(partition {1} order by [0] range between UNBOUNDED PRECEDING and CURRENT ROW aggs [COUNT()])]) 01-04 SelectionVectorRemover 01-05 Sort(sort0=[$1], sort1=[$0], dir0=[ASC], dir1=[ASC]) 01-06 Project($0=[$0], $1=[$1]) 01-07 HashToRandomExchange(dist0=[[$1]]) 02-01 UnorderedMuxExchange 03-01 Project($0=[$0], $1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1))]) 03-02 Project($0=[CAST(ITEM($0, 0)):BIGINT], $1=[CAST(ITEM($0, 1)):VARCHAR(25) CHARACTER SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"]) 03-03 Scan(groupscan=[EasyGroupScan [selectionRoot=/tmp/manyDuplicates.csv, numFiles=1, columns=[`columns`[0], `columns`[1]], files=[maprfs:///tmp/manyDuplicates.csv]]])
python script to generate data in CSV format
import random f = open('/Users/kungfo/manyDuplicates.csv', 'a') for i in range(1,000000): f.write(str(random.choice(xrange(1,1000000)))+','+str(random.choice(['AAAABBBBCCCCDDDDEEEE','FFFFGGGGHHHHIIIIJJJJ','KKKKLLLLMMMMNNNNOOOO','PPPPQQQQRRRRSSSSTTTT','UUUUVVVVWWWWXXXXZZZZ']))+'\n') f.flush()