Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-2309

Selecting count(), avg() of nullable columns causes wrong results

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.9.0
    • Component/s: Execution - Data Types
    • Labels:
      None

      Description

      #Thu Feb 19 18:40:10 EST 2015
      git.commit.id.abbrev=1ceddff

      The following query returns correct count involving columns that contains null value.

      0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, count(tt.nul) from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt group by tt.gbyi order by tt.gbyi;
      +------------+------------+
      |    gbyi    |   EXPR$1   |
      +------------+------------+
      | 0          | 33580      |
      | 1          | 33317      |
      | 2          | 33438      |
      | 3          | 33535      |
      | 4          | 33369      |
      | 5          | 32990      |
      | 6          | 33661      |
      | 7          | 33130      |
      | 8          | 33362      |
      | 9          | 33364      |
      | 10         | 33229      |
      | 11         | 33567      |
      | 12         | 33379      |
      | 13         | 33045      |
      | 14         | 33305      |
      +------------+------------+
      

      But if you add more aggregation to the query, the returned count is wrong (pay attention to the last column).

      0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, sum(tt.id), avg(tt.fl), count(tt.nul) from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt group by tt.gbyi order by tt.gbyi;
      +------------+------------+------------+------------+
      |    gbyi    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
      +------------+------------+------------+------------+
      | 0          | 33445554017 | 499613.0956877819 | 66943      |
      | 1          | 33209358334 | 500760.0252919893 | 66318      |
      | 2          | 33369118041 | 498091.82200273 | 66994      |
      | 3          | 33254533860 | 498696.5063226428 | 66683      |
      | 4          | 33393965595 | 501125.64656145993 | 66638      |
      | 5          | 33216885506 | 499961.32710397616 | 66439      |
      | 6          | 33380205950 | 498875.3923256599 | 66911      |
      | 7          | 33405849390 | 501093.43067788356 | 66666      |
      | 8          | 33136951190 | 498458.1044031481 | 66479      |
      | 9          | 33319291474 | 499967.5392457864 | 66643      |
      | 10         | 33339388887 | 499190.47462408233 | 66787      |
      | 11         | 33571590550 | 502095.86682194035 | 66863      |
      | 12         | 33437342090 | 501708.8141502653 | 66647      |
      | 13         | 33071800925 | 498896.453904129 | 66290      |
      | 14         | 33448664191 | 501487.4206955959 | 66699      |
      +------------+------------+------------+------------+
      [code}
      
      plan for the query returned the wrong result:
      
      

      0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select tt.gbyi, sum(tt.id), avg(tt.fl), count(tt.nul) from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt group by tt.gbyi order by tt.gbyi;
      ----------------------+

      text json

      ----------------------+

      00-00 Screen
      00-01 Project(gbyi=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
      00-02 SingleMergeExchange(sort0=[0 ASC])
      01-01 SelectionVectorRemover
      01-02 Sort(sort0=[$0], dir0=[ASC])
      01-03 Project(gbyi=[$0], EXPR$1=[CASE(=($2, 0), null, $1)], EXPR$2=[CAST(/(CastHigh(CASE(=($4, 0), null, $3)), $4)):ANY], EXPR$3=[$5])
      01-04 HashAgg(group=[{0}], agg#0=[$SUM0($1)], agg#1=[$SUM0($2)], agg#2=[$SUM0($3)], agg#3=[$SUM0($4)], EXPR$3=[$SUM0($5)])
      01-05 HashToRandomExchange(dist0=[[$0]])
      02-01 HashAgg(group=[{0}]
      , agg#0=[$SUM0($1)], agg#1=[COUNT($1)], agg#2=[$SUM0($2)], agg#3=[COUNT($2)], EXPR$3=[COUNT()])
      02-02 Project(gbyi=[$3], id=[$2], fl=[$1], nul=[$0])
      02-03 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, columns=[`gbyi`, `id`, `fl`, `nul`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
      
      

        Attachments

        1. DRILL-2309.patch
          4 kB
          Mehant Baid

          Activity

            People

            • Assignee:
              cchang@maprtech.com Chun Chang
              Reporter:
              cchang@maprtech.com Chun Chang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: