Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-3269

Window function query takes too long to complete and return results

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.1.0
    • Component/s: Execution - Flow
    • Labels:
    • Environment:

      1de6aed93efce8a524964371d96673b8ef192d89

      Description

      Query that uses window functions takes too long to complete and return results. It returns close to a million records, for which it took 533.8 seconds ~8 minutes
      Input CSV file has two columns, one integer and another varchar type column. Please take a look.

      Size of the input CSV file
      root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
      -rwxr-xr-x 3 root root 27889455 2015-06-10 01:26 /tmp/manyDuplicates.csv

      select count(*) over(partition by cast(columns[1] as varchar(25)) order by cast(columns[0] as bigint)) from `manyDuplicates.csv`;
      ...
      1,000,007 rows selected (533.857 seconds)
      

      There are five distinct values in columns[1] in the CSV file. = [FIVE PARTITIONS]

      0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from `manyDuplicates.csv`;
      +-----------------------+
      |        EXPR$0         |
      +-----------------------+
      | FFFFGGGGHHHHIIIIJJJJ  |
      | PPPPQQQQRRRRSSSSTTTT  |
      | AAAABBBBCCCCDDDDEEEE  |
      | UUUUVVVVWWWWXXXXZZZZ  |
      | KKKKLLLLMMMMNNNNOOOO  |
      +-----------------------+
      5 rows selected (1.906 seconds)
      

      Here is the count for each of those values in columns[1]

      0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
      +---------+
      | EXPR$0  |
      +---------+
      | 200484  |
      +---------+
      1 row selected (0.961 seconds)
      
      0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
      +---------+
      | EXPR$0  |
      +---------+
      | 199353  |
      +---------+
      1 row selected (0.86 seconds)
      
      0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
      +---------+
      | EXPR$0  |
      +---------+
      | 200702  |
      +---------+
      1 row selected (0.826 seconds)
      
      0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
      +---------+
      | EXPR$0  |
      +---------+
      | 199916  |
      +---------+
      1 row selected (0.851 seconds)
      
      0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
      +---------+
      | EXPR$0  |
      +---------+
      | 199552  |
      +---------+
      1 row selected (0.827 seconds)
      

      Query plan for the long running query

      | 00-00    Screen
      00-01      UnionExchange
      01-01        Project(EXPR$0=[$0])
      01-02          Project($0=[$2])
      01-03            Window(window#0=[window(partition {1} order by [0] range between UNBOUNDED PRECEDING and CURRENT ROW aggs [COUNT()])])
      01-04              SelectionVectorRemover
      01-05                Sort(sort0=[$1], sort1=[$0], dir0=[ASC], dir1=[ASC])
      01-06                  Project($0=[$0], $1=[$1])
      01-07                    HashToRandomExchange(dist0=[[$1]])
      02-01                      UnorderedMuxExchange
      03-01                        Project($0=[$0], $1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1))])
      03-02                          Project($0=[CAST(ITEM($0, 0)):BIGINT], $1=[CAST(ITEM($0, 1)):VARCHAR(25) CHARACTER SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"])
      03-03                            Scan(groupscan=[EasyGroupScan [selectionRoot=/tmp/manyDuplicates.csv, numFiles=1, columns=[`columns`[0], `columns`[1]], files=[maprfs:///tmp/manyDuplicates.csv]]])
      

      python script to generate data in CSV format

      import random
      f = open('/Users/kungfo/manyDuplicates.csv', 'a')
      for i in range(1,000000):
          f.write(str(random.choice(xrange(1,1000000)))+','+str(random.choice(['AAAABBBBCCCCDDDDEEEE','FFFFGGGGHHHHIIIIJJJJ','KKKKLLLLMMMMNNNNOOOO','PPPPQQQQRRRRSSSSTTTT','UUUUVVVVWWWWXXXXZZZZ']))+'\n')
          f.flush()
          
          
      

        Attachments

          Activity

            People

            • Assignee:
              adeneche Deneche A. Hakim
              Reporter:
              khfaraaz Khurram Faraaz
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: