Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4324

Hive native reader is slow when the underlying parquet file has more row groups

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      git.commit.id.abbrev=3d0b4b0

      TPCDS Query 84:

      SELECT c_customer_id   AS customer_id, 
                     c_last_name 
                     || ', ' 
                     || c_first_name AS customername 
      FROM   customer, 
             customer_address, 
             customer_demographics, 
             household_demographics, 
             income_band, 
             store_returns 
      WHERE  ca_city = 'Green Acres' 
             AND c_current_addr_sk = ca_address_sk 
             AND ib_lower_bound >= 54986 
             AND ib_upper_bound <= 54986 + 50000 
             AND ib_income_band_sk = hd_income_band_sk 
             AND cd_demo_sk = c_current_cdemo_sk 
             AND hd_demo_sk = c_current_hdemo_sk 
             AND sr_cdemo_sk = cd_demo_sk 
      ORDER  BY c_customer_id
      LIMIT 100;
      

      Execution times :

      Hive Plugin : 12.34 seconds
      Hive Native Reader : 360.866
      DFS Parquet Reader : 84.3 seconds
      

      Note : These data sets were generated by hive and the underlying parquet files have more than 1 row groups (household_demographics has ~8000 row groups)

      The data files are larger than 10 MB to attach them here. Reach out to me if you need anything else

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rkins Rahul Kumar Challapalli
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: