Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-1977

Wrong result with aggregation on top of UNION ALL operator

    XMLWordPrintableJSON

Details

    Description

      Jan 10 01:17:13 root@~ ] # hadoop fs -cat /test/t.json
      { "a1": 0, "b1": 0, "c1": "true",   "d1" : "2015-01-02"}
      { "a1": 0, "b1": 0, "c1": "false" , "d1" : "2015-01-03"}
      { "a1": 0, "b1": 0, "c1": "false" , "d1" : "2015-01-04"}
      { "a1": 1, "b1": 1, "c1": "true" ,  "d1" : "2015-01-05"}
      { "a1": 1, "b1": 1, "c1": "true" ,  "d1" : "2015-01-06"}
      

      Query below returns wrong result, we should have 4 groups, instead we get only two groups from right side of the union.
      Notice lack of project step above union all (unclear if it has anything to to with the bug)

      0: jdbc:drill:schema=dfs> select calc1, max(a1), min(b1), count(c1) from (select a1+10 as calc1, a1, b1, c1 from `t.json` union all select a1+100 as calc1, a1, b1, c1 from `t.json`) group by calc1;
      +------------+------------+------------+------------+
      |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
      +------------+------------+------------+------------+
      | 100        | 0          | 0          | 3          |
      | 101        | 1          | 1          | 2          |
      +------------+------------+------------+------------+
      2 rows selected (0.181 seconds)
      
      00-01      Project(calc1=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
      00-02        HashAgg(group=[{0}], EXPR$1=[MAX($1)], EXPR$2=[MIN($2)], EXPR$3=[COUNT($3)])
      00-03          UnionAll(all=[true])
      00-05            Project(calc1=[+($2, 10)], a1=[$2], b1=[$1], c1=[$0])
      00-07              Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])
      00-04            Project(calc1=[+($2, 100)], a1=[$2], b1=[$1], c1=[$0])
      00-06              Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])
      

      Correct result, when order of the columns is different in aggregate functions: "min(b1), max(a1)" instead of "max(a1), min(b1)"

      0: jdbc:drill:schema=dfs> select calc1, min(b1), max(a1), count(c1) from (select a1+10 as calc1, a1, b1, c1 from `t.json` union all select a1+100 as calc1, a1, b1, c1 from `t.json`) group by calc1;
      +------------+------------+------------+------------+
      |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
      +------------+------------+------------+------------+
      | 10         | 0          | 0          | 3          |
      | 11         | 1          | 1          | 2          |
      | 100        | 0          | 0          | 3          |
      | 101        | 1          | 1          | 2          |
      +------------+------------+------------+------------+
      4 rows selected (0.256 seconds)
      
      00-01      Project(calc1=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
      00-02        HashAgg(group=[{0}], EXPR$1=[MIN($1)], EXPR$2=[MAX($2)], EXPR$3=[COUNT($3)])
      00-03          Project(calc1=[$0], b1=[$2], a1=[$1], c1=[$3])
      00-04            UnionAll(all=[true])
      00-06              Project(calc1=[+($2, 10)], a1=[$2], b1=[$1], c1=[$0])
      00-08                Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])
      00-05              Project(calc1=[+($2, 100)], a1=[$2], b1=[$1], c1=[$0])
      00-07                Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])

      Non-union query works correctly:

      0: jdbc:drill:schema=dfs> select a1+10 as calc1, min(a1), max(b1), count(c1) from `t.json` group by a1+10;
      +------------+------------+------------+------------+
      |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
      +------------+------------+------------+------------+
      | 10         | 0          | 0          | 3          |
      | 11         | 1          | 1          | 2          |
      +------------+------------+------------+------------+
      2 rows selected (0.142 seconds)
      0: jdbc:drill:schema=dfs> select a1+10 as calc1, min(b1), max(a1), count(c1) from `t.json` group by a1+10;
      +------------+------------+------------+------------+
      |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
      +------------+------------+------------+------------+
      | 10         | 0          | 0          | 3          |
      | 11         | 1          | 1          | 2          |
      +------------+------------+------------+------------+
      2 rows selected (0.153 seconds)
      

      If I switch legs of a UNION, we still get wrong result, but from the right side:

      0: jdbc:drill:schema=dfs> select calc1, max(a1), min(b1), count(c1) from (select a1+100 as calc1, a1, b1, c1 from `t.json` union all select a1+10 as calc1, a1, b1, c1 from `t.json`) group by calc1;
      +------------+------------+------------+------------+
      |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
      +------------+------------+------------+------------+
      | 10         | 0          | 0          | 3          |
      | 11         | 1          | 1          | 2          |
      +------------+------------+------------+------------+
      2 rows selected (0.19 seconds)
      

      Attachments

        Issue Links

          Activity

            People

              seanhychu Sean Hsuan-Yi Chu
              vicky Victoria Markman
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: