Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5038

File size mismatch in PlannerTest.testPredicatePropagation

    Details

      Description

      A recent commit broke PlannerTest.testPredicatePropagation. There's a file size mismatch. The tricky part is that we already added code to ignore file size mismatches in IMPALA-2565. However, the code needs to be generalized to take into account differences in the unit, e.g., "B", vs. "KB". See the actual//results:

      Section PLAN of query:
      SELECT count(*) FROM
       (SELECT * from tpch_parquet.customer c CROSS JOIN tpch_parquet.nation n
        WHERE n_name = 'BRAZIL' AND n_regionkey = 1 AND c_custkey % 2 = 0) cn
       LEFT OUTER JOIN tpch_parquet.region r ON n_regionkey = r_regionkey
      
      Actual does not match expected result:
      PLAN-ROOT SINK
      |
      05:AGGREGATE [FINALIZE]
      |  output: count(*)
      |
      04:HASH JOIN [LEFT OUTER JOIN]
      |  hash predicates: n.n_regionkey = r_regionkey
      |
      |--03:SCAN HDFS [tpch_parquet.region r]
      |     partitions=1/1 files=1 size=939B
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      |     predicates: r.r_regionkey = 1
      |
      02:NESTED LOOP JOIN [CROSS JOIN]
      |
      |--01:SCAN HDFS [tpch_parquet.nation n]
      |     partitions=1/1 files=1 size=2.25KB
      |     predicates: n_regionkey = 1, n_name = 'BRAZIL'
      |
      00:SCAN HDFS [tpch_parquet.customer c]
         partitions=1/1 files=1 size=12.27MB
         predicates: c_custkey % 2 = 0
      
      Expected:
      PLAN-ROOT SINK
      |
      05:AGGREGATE [FINALIZE]
      |  output: count(*)
      |
      04:HASH JOIN [LEFT OUTER JOIN]
      |  hash predicates: n.n_regionkey = r_regionkey
      |
      |--03:SCAN HDFS [tpch_parquet.region r]
      |     partitions=1/1 files=1 size=1.01KB
      |     predicates: r.r_regionkey = 1
      |
      02:NESTED LOOP JOIN [CROSS JOIN]
      |
      |--01:SCAN HDFS [tpch_parquet.nation n]
      |     partitions=1/1 files=1 size=2.38KB
      |     predicates: n_regionkey = 1, n_name = 'BRAZIL'
      |
      00:SCAN HDFS [tpch_parquet.customer c]
         partitions=1/1 files=1 size=12.27MB
         predicates: c_custkey % 2 = 0
      
      Verbose plan:
      F00:PLAN FRAGMENT [UNPARTITIONED]
        PLAN-ROOT SINK
        |
        05:AGGREGATE [FINALIZE]
        |  output: count(*)
        |  hosts=1 per-host-mem=unavailable
        |  tuple-ids=4 row-size=8B cardinality=1
        |
        04:HASH JOIN [LEFT OUTER JOIN]
        |  hash predicates: n.n_regionkey = r_regionkey
        |  hosts=1 per-host-mem=unavailable
        |  tuple-ids=0,1,3N row-size=35B cardinality=15000
        |
        |--03:SCAN HDFS [tpch_parquet.region r]
        |     partitions=1/1 files=1 size=939B
        |     predicates: r.r_regionkey = 1
        |     table stats: 5 rows total
        |     column stats: all
        |     parquet statistics predicates: r.r_regionkey = 1
        |     parquet dictionary predicates: r.r_regionkey = 1
        |     hosts=1 per-host-mem=unavailable
        |     tuple-ids=3 row-size=2B cardinality=1
        |
        02:NESTED LOOP JOIN [CROSS JOIN]
        |  hosts=1 per-host-mem=unavailable
        |  tuple-ids=0,1 row-size=33B cardinality=15000
        |
        |--01:SCAN HDFS [tpch_parquet.nation n]
        |     partitions=1/1 files=1 size=2.25KB
        |     predicates: n_regionkey = 1, n_name = 'BRAZIL'
        |     table stats: 25 rows total
        |     column stats: all
        |     parquet statistics predicates: n_regionkey = 1, n_name = 'BRAZIL'
        |     parquet dictionary predicates: n_regionkey = 1, n_name = 'BRAZIL'
        |     hosts=1 per-host-mem=unavailable
        |     tuple-ids=1 row-size=25B cardinality=1
        |
        00:SCAN HDFS [tpch_parquet.customer c]
           partitions=1/1 files=1 size=12.27MB
           predicates: c_custkey % 2 = 0
           table stats: 150000 rows total
           column stats: all
           parquet dictionary predicates: c_custkey % 2 = 0
           hosts=1 per-host-mem=unavailable
           tuple-ids=0 row-size=8B cardinality=15000
      

        Activity

        Hide
        joemcdonnell Joe McDonnell added a comment -

        commit c4fb67c98c44c28f0c375b11a5ef996a2c52bbb2
        Author: Joe McDonnell <joemcdonnell@cloudera.com>
        Date: Tue Mar 7 09:15:14 2017 -0800

        IMPALA-5038: Fix file size regex to include bytes

        There is a regex to remove file sizes from test logs to avoid diffs
        caused by changes in file headers, etc. The regex was detecting
        size=number correctly, but it does not include the byte unit (i.e.
        size=900B is distinct from size=900KB). A file size change introduced
        by IMPALA-4624 caused a diff because the file size changed from 900B
        to 1.1KB. This fixes the regex to include the byte unit, so that
        changes of this form do not cause diffs.

        Change-Id: I5f7b8480065a4da63c8abaf53aae8aef772f0172
        Reviewed-on: http://gerrit.cloudera.org:8080/6288
        Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        joemcdonnell Joe McDonnell added a comment - commit c4fb67c98c44c28f0c375b11a5ef996a2c52bbb2 Author: Joe McDonnell <joemcdonnell@cloudera.com> Date: Tue Mar 7 09:15:14 2017 -0800 IMPALA-5038 : Fix file size regex to include bytes There is a regex to remove file sizes from test logs to avoid diffs caused by changes in file headers, etc. The regex was detecting size=number correctly, but it does not include the byte unit (i.e. size=900B is distinct from size=900KB). A file size change introduced by IMPALA-4624 caused a diff because the file size changed from 900B to 1.1KB. This fixes the regex to include the byte unit, so that changes of this form do not cause diffs. Change-Id: I5f7b8480065a4da63c8abaf53aae8aef772f0172 Reviewed-on: http://gerrit.cloudera.org:8080/6288 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Impala Public Jenkins

          People

          • Assignee:
            joemcdonnell Joe McDonnell
            Reporter:
            alex.behm Alexander Behm
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development