Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-2488

Wrong result on join between two subqueries with aggregation

    XMLWordPrintableJSON

Details

    Description

      0: jdbc:drill:schema=dfs> select * from t1;
      +------------+------------+------------+
      |     a1     |     b1     |     c1     |
      +------------+------------+------------+
      | 1          | aaaaa      | 2015-01-01 |
      | 2          | bbbbb      | 2015-01-02 |
      | 3          | ccccc      | 2015-01-03 |
      | 4          | null       | 2015-01-04 |
      | 5          | eeeee      | 2015-01-05 |
      | 6          | fffff      | 2015-01-06 |
      | 7          | ggggg      | 2015-01-07 |
      | null       | hhhhh      | 2015-01-08 |
      | 9          | iiiii      | null       |
      | 10         | jjjjj      | 2015-01-10 |
      +------------+------------+------------+
      10 rows selected (0.15 seconds)
      

      This result is incorrect, one row is missing

      0: jdbc:drill:schema=dfs> select * from
      . . . . . . . . . . . . > (
      . . . . . . . . . . . . >                 select
      . . . . . . . . . . . . >                         b1,
      . . . . . . . . . . . . >                         count(distinct a1)
      . . . . . . . . . . . . >                 from
      . . . . . . . . . . . . >                         t1
      . . . . . . . . . . . . >                 group by
      . . . . . . . . . . . . >                         b1
      . . . . . . . . . . . . >                 order by
      . . . . . . . . . . . . >                         b1 limit 5 offset 1
      . . . . . . . . . . . . >         ) as sq1(x1, y1)
      . . . . . . . . . . . . >
      . . . . . . . . . . . . >         inner join
      . . . . . . . . . . . . >
      . . . . . . . . . . . . >         (
      . . . . . . . . . . . . >                 select
      . . . . . . . . . . . . >                         b1,
      . . . . . . . . . . . . >                         count(distinct a1)
      . . . . . . . . . . . . >                 from
      . . . . . . . . . . . . >                         t1
      . . . . . . . . . . . . >                 group by
      . . . . . . . . . . . . >                         b1
      . . . . . . . . . . . . >                 order by
      . . . . . . . . . . . . >                         b1 limit 5 offset 1
      . . . . . . . . . . . . >         ) as sq2(x1, y1)
      . . . . . . . . . . . . >         on
      . . . . . . . . . . . . >                 sq1.x1 = sq2.x1 and
      . . . . . . . . . . . . >                 sq2.y1 = sq2.y1
      . . . . . . . . . . . . > ;
      +------------+------------+------------+------------+
      |     x1     |     y1     |    x10     |    y10     |
      +------------+------------+------------+------------+
      | bbbbb      | 1          | bbbbb      | 1          |
      | ccccc      | 1          | ccccc      | 1          |
      | eeeee      | 1          | eeeee      | 1          |
      | fffff      | 1          | fffff      | 1          |
      +------------+------------+------------+------------+
      4 rows selected (0.28 seconds)
      

      Explain plan for the wrong result:

      00-01      Project(x1=[$0], y1=[$1], x10=[$2], y10=[$3])
      00-02        Project(x1=[$0], y1=[$1], x10=[$2], y10=[$3])
      00-03          MergeJoin(condition=[=($0, $2)], joinType=[inner])
      00-05            Limit(offset=[1], fetch=[5])
      00-07              StreamAgg(group=[{0}], EXPR$1=[COUNT($1)])
      00-09                Sort(sort0=[$0], dir0=[ASC])
      00-11                  StreamAgg(group=[{0, 1}])
      00-13                    Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
      00-15                      Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/aggregation/t1]], selectionRoot=/drill/testdata/aggregation/t1, numFiles=1, columns=[`b1`, `a1`]]])
      00-04            Project(b10=[$0], EXPR$10=[$1])
      00-06              SelectionVectorRemover
      00-08                Sort(sort0=[$0], dir0=[ASC])
      00-10                  Filter(condition=[=($1, $1)])
      00-12                    Limit(offset=[1], fetch=[5])
      00-14                      StreamAgg(group=[{0}], EXPR$1=[COUNT($1)])
      00-16                        Sort(sort0=[$0], dir0=[ASC])
      00-17                          StreamAgg(group=[{0, 1}])
      00-18                            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
      00-19                              Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/aggregation/t1]], selectionRoot=/drill/testdata/aggregation/t1, numFiles=1, columns=[`b1`, `a1`]]])
      

      If you turn off merge join, query returns correct result:

      0: jdbc:drill:schema=dfs> select * from
      . . . . . . . . . . . . > (
      . . . . . . . . . . . . >     select
      . . . . . . . . . . . . >         b1,
      . . . . . . . . . . . . >         count(distinct a1)
      . . . . . . . . . . . . >     from
      . . . . . . . . . . . . >         t1
      . . . . . . . . . . . . >     group by
      . . . . . . . . . . . . >         b1
      . . . . . . . . . . . . >     order by
      . . . . . . . . . . . . >         b1 limit 5 offset 1
      . . . . . . . . . . . . >  ) as sq1(x1, y1)
      . . . . . . . . . . . . >
      . . . . . . . . . . . . >         inner join
      . . . . . . . . . . . . > (
      . . . . . . . . . . . . >     select
      . . . . . . . . . . . . >         b1,
      . . . . . . . . . . . . >         count(distinct a1)
      . . . . . . . . . . . . >     from
      . . . . . . . . . . . . >         t1
      . . . . . . . . . . . . >     group by
      . . . . . . . . . . . . >         b1
      . . . . . . . . . . . . >     order by
      . . . . . . . . . . . . >           b1 limit 5 offset 1
      . . . . . . . . . . . . > ) as sq2(x1, y1)
      . . . . . . . . . . . . > on
      . . . . . . . . . . . . >                 sq1.x1 = sq2.x1 and
      . . . . . . . . . . . . >                 sq2.y1 = sq2.y1
      . . . . . . . . . . . . > ;
      +------------+------------+------------+------------+
      |     x1     |     y1     |    x10     |    y10     |
      +------------+------------+------------+------------+
      | bbbbb      | 1          | bbbbb      | 1          |
      | ccccc      | 1          | ccccc      | 1          |
      | eeeee      | 1          | eeeee      | 1          |
      | fffff      | 1          | fffff      | 1          |
      | ggggg      | 1          | ggggg      | 1          |
      +------------+------------+------------+------------+
      5 rows selected (0.352 seconds)
      

      cut/paste reproduction

      select * from
       (
                       select 
                               b1,                      
                               count(distinct a1)       
                       from 
                               t1                       
                       group by                 
                               b1                       
                       order by                 
                               b1 limit 5 offset 1      
               ) as sq1(x1, y1)         
        
               inner join 
        
               (
                       select 
                               b1,                      
                               count(distinct a1)       
                       from 
                               t1                       
                       group by                 
                               b1                       
                       order by                 
                               b1 limit 5 offset 1      
               ) as sq2(x1, y1)         
               on 
                       sq1.x1 = sq2.x1 and      
                       sq2.y1 = sq2.y1          
       ;
      

      This test has been running and passing in regression test suite until framework was switched to a new code, where JSON parsing was replaced with jackson and for a brief period ( I believe Friday afternoon until now ) this suite was not executed.
      We already have a merge join bug DRILL-2010, but this one seems to be different (small data set) and feels like a recent regression.

      For QA: test Functional/Passing/aggregation/sanity/q18.sql will be running with merge join disabled until this issue is fixed. Will need to remove alter session from this file.

      Attachments

        1. 0001-DRILL-2488-Return-DEFAULT-as-supported-encoding-for-.patch
          1 kB
          Aman Sinha
        2. t1.parquet
          0.5 kB
          Victoria Markman

        Activity

          People

            amansinha100 Aman Sinha
            vicky Victoria Markman
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: