Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.2.0
-
None
Description
Currently HiveDrillNativeParquetScan.getScanStats() divides the rowCount got from HiveScan by a factor and returns that as cost. Problem is all cost calculations and parallelization depends on the rowCount. Value cpuCost is not taken into consideration in current cost calculations in ScanPrel. In order for the planner to choose HiveDrillNativeParquetScan over HiveScan, rowCount has to be lowered for the former, but this leads to lower parallelization and performance degradation.
Temporary fix for Drill 1.2 before DRILL-3856 fully resolves considering CPU cost in cost model:
1. Change ScanPrel to consider the CPU cost in given Stats from GroupScan
2. Have higher CPU cost for HiveScan (SerDe route)
3. Lower CPU cost for HiveDrillNativeParquetScan