[DRILL-3884] Hive native scan has lower parallelization leading to performance degradation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: 1.2.0
Component/s: Query Planning & Optimization, Storage - Hive
Labels:
None

Description

Currently HiveDrillNativeParquetScan.getScanStats() divides the rowCount got from HiveScan by a factor and returns that as cost. Problem is all cost calculations and parallelization depends on the rowCount. Value cpuCost is not taken into consideration in current cost calculations in ScanPrel. In order for the planner to choose HiveDrillNativeParquetScan over HiveScan, rowCount has to be lowered for the former, but this leads to lower parallelization and performance degradation.

Temporary fix for Drill 1.2 before DRILL-3856 fully resolves considering CPU cost in cost model:
1. Change ScanPrel to consider the CPU cost in given Stats from GroupScan
2. Have higher CPU cost for HiveScan (SerDe route)
3. Lower CPU cost for HiveDrillNativeParquetScan

Attachments

Activity

People

Assignee:: Chun Chang

Reporter:: Venki Korukanti

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Oct/15 23:25

Updated:: 06/Oct/15 16:54

Resolved:: 02/Oct/15 20:18