[DRILL-4324] Hive native reader is slow when the underlying parquet file has more row groups - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Storage - Hive, Storage - Parquet
Labels:
None

Description

git.commit.id.abbrev=3d0b4b0

TPCDS Query 84:

SELECT c_customer_id   AS customer_id, 
               c_last_name 
               || ', ' 
               || c_first_name AS customername 
FROM   customer, 
       customer_address, 
       customer_demographics, 
       household_demographics, 
       income_band, 
       store_returns 
WHERE  ca_city = 'Green Acres' 
       AND c_current_addr_sk = ca_address_sk 
       AND ib_lower_bound >= 54986 
       AND ib_upper_bound <= 54986 + 50000 
       AND ib_income_band_sk = hd_income_band_sk 
       AND cd_demo_sk = c_current_cdemo_sk 
       AND hd_demo_sk = c_current_hdemo_sk 
       AND sr_cdemo_sk = cd_demo_sk 
ORDER  BY c_customer_id
LIMIT 100;

Execution times :

Hive Plugin : 12.34 seconds
Hive Native Reader : 360.866
DFS Parquet Reader : 84.3 seconds

Note : These data sets were generated by hive and the underlying parquet files have more than 1 row groups (household_demographics has ~8000 row groups)

The data files are larger than 10 MB to attach them here. Reach out to me if you need anything else

Attachments

Issue Links

blocks

DRILL-4309 Make this option store.hive.optimize_scan_with_native_readers=true default

Open

Activity

People

Assignee:: Unassigned

Reporter:: Rahul Kumar Challapalli

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 28/Jan/16 23:10

Updated:: 04/Jan/18 15:10