Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-8120 Umbrella JIRA tracking Parquet improvements
  3. HIVE-11763

Use * instead of sum(hash(*)) on Parquet predicate (PPD) integration tests

    Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Component/s: None
    • Labels:
      None

      Description

      The integration tests for Parquet predicate push down (PPD) use the following query to validate the values filtered:

      select sum(hash(*)) from ...
      

      It would be better if we use select * from ... instead to see that those values are correct. It is difficult to see if a value was filtered by seeing the hash.

      Also, we can try to limit the number of rows of the INSERT ... SELECT statmenet to avoid displaying many rows when validating the data. I think a LIMIT 2 on each of the SELECT.

      For example, the parquet_ppd_boolean.ppd has this:

      insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee" as varchar(10)), 0.22, true from src src1 union all select cast("hello" as char(10)), cast("world" as varchar(10)), 11.22, false from src src2) uniontbl;
      

      If we use LIMIT 2, then we will reduce the # of rows:

      insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee" as varchar(10)), 0.22, true from src src1 LIMIT 2 union all select cast("hello" as char(10)), cast("world" as varchar(10)), 11.22, false from src src2 LIMIT 2) uniontbl;
      

        Attachments

          Activity

            People

            • Assignee:
              spena Sergio Peña
              Reporter:
              spena Sergio Peña
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: