Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-2488

Wrong result on join between two subqueries with aggregation

    XMLWordPrintableJSON

Details

    Description

      0: jdbc:drill:schema=dfs> select * from t1;
      +------------+------------+------------+
      |     a1     |     b1     |     c1     |
      +------------+------------+------------+
      | 1          | aaaaa      | 2015-01-01 |
      | 2          | bbbbb      | 2015-01-02 |
      | 3          | ccccc      | 2015-01-03 |
      | 4          | null       | 2015-01-04 |
      | 5          | eeeee      | 2015-01-05 |
      | 6          | fffff      | 2015-01-06 |
      | 7          | ggggg      | 2015-01-07 |
      | null       | hhhhh      | 2015-01-08 |
      | 9          | iiiii      | null       |
      | 10         | jjjjj      | 2015-01-10 |
      +------------+------------+------------+
      10 rows selected (0.15 seconds)
      

      This result is incorrect, one row is missing

      0: jdbc:drill:schema=dfs> select * from
      . . . . . . . . . . . . > (
      . . . . . . . . . . . . >                 select
      . . . . . . . . . . . . >                         b1,
      . . . . . . . . . . . . >                         count(distinct a1)
      . . . . . . . . . . . . >                 from
      . . . . . . . . . . . . >                         t1
      . . . . . . . . . . . . >                 group by
      . . . . . . . . . . . . >                         b1
      . . . . . . . . . . . . >                 order by
      . . . . . . . . . . . . >                         b1 limit 5 offset 1
      . . . . . . . . . . . . >         ) as sq1(x1, y1)
      . . . . . . . . . . . . >
      . . . . . . . . . . . . >         inner join
      . . . . . . . . . . . . >
      . . . . . . . . . . . . >         (
      . . . . . . . . . . . . >                 select
      . . . . . . . . . . . . >                         b1,
      . . . . . . . . . . . . >                         count(distinct a1)
      . . . . . . . . . . . . >                 from
      . . . . . . . . . . . . >                         t1
      . . . . . . . . . . . . >                 group by
      . . . . . . . . . . . . >                         b1
      . . . . . . . . . . . . >                 order by
      . . . . . . . . . . . . >                         b1 limit 5 offset 1
      . . . . . . . . . . . . >         ) as sq2(x1, y1)
      . . . . . . . . . . . . >         on
      . . . . . . . . . . . . >                 sq1.x1 = sq2.x1 and
      . . . . . . . . . . . . >                 sq2.y1 = sq2.y1
      . . . . . . . . . . . . > ;
      +------------+------------+------------+------------+
      |     x1     |     y1     |    x10     |    y10     |
      +------------+------------+------------+------------+
      | bbbbb      | 1          | bbbbb      | 1          |
      | ccccc      | 1          | ccccc      | 1          |
      | eeeee      | 1          | eeeee      | 1          |
      | fffff      | 1          | fffff      | 1          |
      +------------+------------+------------+------------+
      4 rows selected (0.28 seconds)
      

      Explain plan for the wrong result:

      00-01      Project(x1=[$0], y1=[$1], x10=[$2], y10=[$3])
      00-02        Project(x1=[$0], y1=[$1], x10=[$2], y10=[$3])
      00-03          MergeJoin(condition=[=($0, $2)], joinType=[inner])
      00-05            Limit(offset=[1], fetch=[5])
      00-07              StreamAgg(group=[{0}], EXPR$1=[COUNT($1)])
      00-09                Sort(sort0=[$0], dir0=[ASC])
      00-11                  StreamAgg(group=[{0, 1}])
      00-13                    Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
      00-15                      Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/aggregation/t1]], selectionRoot=/drill/testdata/aggregation/t1, numFiles=1, columns=[`b1`, `a1`]]])
      00-04            Project(b10=[$0], EXPR$10=[$1])
      00-06              SelectionVectorRemover
      00-08                Sort(sort0=[$0], dir0=[ASC])
      00-10                  Filter(condition=[=($1, $1)])
      00-12                    Limit(offset=[1], fetch=[5])
      00-14                      StreamAgg(group=[{0}], EXPR$1=[COUNT($1)])
      00-16                        Sort(sort0=[$0], dir0=[ASC])
      00-17                          StreamAgg(group=[{0, 1}])
      00-18                            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
      00-19                              Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/aggregation/t1]], selectionRoot=/drill/testdata/aggregation/t1, numFiles=1, columns=[`b1`, `a1`]]])
      

      If you turn off merge join, query returns correct result:

      0: jdbc:drill:schema=dfs> select * from
      . . . . . . . . . . . . > (
      . . . . . . . . . . . . >     select
      . . . . . . . . . . . . >         b1,
      . . . . . . . . . . . . >         count(distinct a1)
      . . . . . . . . . . . . >     from
      . . . . . . . . . . . . >         t1
      . . . . . . . . . . . . >     group by
      . . . . . . . . . . . . >         b1
      . . . . . . . . . . . . >     order by
      . . . . . . . . . . . . >         b1 limit 5 offset 1
      . . . . . . . . . . . . >  ) as sq1(x1, y1)
      . . . . . . . . . . . . >
      . . . . . . . . . . . . >         inner join
      . . . . . . . . . . . . > (
      . . . . . . . . . . . . >     select
      . . . . . . . . . . . . >         b1,
      . . . . . . . . . . . . >         count(distinct a1)
      . . . . . . . . . . . . >     from
      . . . . . . . . . . . . >         t1
      . . . . . . . . . . . . >     group by
      . . . . . . . . . . . . >         b1
      . . . . . . . . . . . . >     order by
      . . . . . . . . . . . . >           b1 limit 5 offset 1
      . . . . . . . . . . . . > ) as sq2(x1, y1)
      . . . . . . . . . . . . > on
      . . . . . . . . . . . . >                 sq1.x1 = sq2.x1 and
      . . . . . . . . . . . . >                 sq2.y1 = sq2.y1
      . . . . . . . . . . . . > ;
      +------------+------------+------------+------------+
      |     x1     |     y1     |    x10     |    y10     |
      +------------+------------+------------+------------+
      | bbbbb      | 1          | bbbbb      | 1          |
      | ccccc      | 1          | ccccc      | 1          |
      | eeeee      | 1          | eeeee      | 1          |
      | fffff      | 1          | fffff      | 1          |
      | ggggg      | 1          | ggggg      | 1          |
      +------------+------------+------------+------------+
      5 rows selected (0.352 seconds)
      

      cut/paste reproduction

      select * from
       (
                       select 
                               b1,                      
                               count(distinct a1)       
                       from 
                               t1                       
                       group by                 
                               b1                       
                       order by                 
                               b1 limit 5 offset 1      
               ) as sq1(x1, y1)         
        
               inner join 
        
               (
                       select 
                               b1,                      
                               count(distinct a1)       
                       from 
                               t1                       
                       group by                 
                               b1                       
                       order by                 
                               b1 limit 5 offset 1      
               ) as sq2(x1, y1)         
               on 
                       sq1.x1 = sq2.x1 and      
                       sq2.y1 = sq2.y1          
       ;
      

      This test has been running and passing in regression test suite until framework was switched to a new code, where JSON parsing was replaced with jackson and for a brief period ( I believe Friday afternoon until now ) this suite was not executed.
      We already have a merge join bug DRILL-2010, but this one seems to be different (small data set) and feels like a recent regression.

      For QA: test Functional/Passing/aggregation/sanity/q18.sql will be running with merge join disabled until this issue is fixed. Will need to remove alter session from this file.

      Attachments

        1. t1.parquet
          0.5 kB
          Victoria Markman
        2. 0001-DRILL-2488-Return-DEFAULT-as-supported-encoding-for-.patch
          1 kB
          Aman Sinha

        Activity

          People

            amansinha100 Aman Sinha
            vicky Victoria Markman
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: