Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-3648

NTILE function returns incorrect results

    XMLWordPrintableJSON

Details

    Description

      NTILE function returns incorrect results for larger dataset. I am working on reproducing the problem with a smaller dataset.

      The inner query that uses NTILE should have divided the rows into two sets (tiles) where each tile consists of (937088 + 1 ) rows , 937088 rows

      0: jdbc:drill:schema=dfs.tmp> select ntile_key2, count(ntile_key2) from (select ntile(2) over(partition by key2 order by key1) ntile_key2 from `twoKeyJsn.json` where key2 = 'm') group by ntile_key2;
      +-------------+----------+
      | ntile_key2  |  EXPR$1  |
      +-------------+----------+
      | 1           | 1        |
      | 2           | 1874176  |
      +-------------+----------+
      2 rows selected (49.406 seconds)
      

      Explain plan for inner query that returns wrong results.

      0: jdbc:drill:schema=dfs.tmp> explain plan for select ntile(2) over(partition by key2 order by key1) from `twoKeyJsn.json` where key2 = 'm';
      +------+------+
      | text | json |
      +------+------+
      | 00-00    Screen
      00-01      UnionExchange
      01-01        Project(EXPR$0=[$0])
      01-02          Project($0=[$2])
      01-03            Window(window#0=[window(partition {0} order by [1] range between UNBOUNDED PRECEDING and CURRENT ROW aggs [NTILE($2)])])
      01-04              SelectionVectorRemover
      01-05                Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
      01-06                  Project(key2=[$0], key1=[$1])
      01-07                    HashToRandomExchange(dist0=[[$0]])
      02-01                      UnorderedMuxExchange
      03-01                        Project(key2=[$0], key1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
      03-02                          SelectionVectorRemover
      03-03                            Filter(condition=[=($0, 'm')])
      03-04                              Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/tmp/twoKeyJsn.json, numFiles=1, columns=[`key2`, `key1`], files=[maprfs:///tmp/twoKeyJsn.json]]])
      

      Total number of rows in partition that has key2 = 'm'

      0: jdbc:drill:schema=dfs.tmp> select count(key1) from `twoKeyJsn.json` where key2 = 'm';
      +----------+
      |  EXPR$0  |
      +----------+
      | 1874177  |
      +----------+
      1 row selected (37.581 seconds)
      

      Attachments

        Issue Links

          Activity

            People

              adeneche Abdel Hakim Deneche
              khfaraaz Khurram Faraaz
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: