Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-3884

Hive native scan has lower parallelization leading to performance degradation

    Details

      Description

      Currently HiveDrillNativeParquetScan.getScanStats() divides the rowCount got from HiveScan by a factor and returns that as cost. Problem is all cost calculations and parallelization depends on the rowCount. Value cpuCost is not taken into consideration in current cost calculations in ScanPrel. In order for the planner to choose HiveDrillNativeParquetScan over HiveScan, rowCount has to be lowered for the former, but this leads to lower parallelization and performance degradation.

      Temporary fix for Drill 1.2 before DRILL-3856 fully resolves considering CPU cost in cost model:
      1. Change ScanPrel to consider the CPU cost in given Stats from GroupScan
      2. Have higher CPU cost for HiveScan (SerDe route)
      3. Lower CPU cost for HiveDrillNativeParquetScan

        Attachments

          Activity

            People

            • Assignee:
              cchang@maprtech.com Chun Chang
              Reporter:
              vkorukanti Venki Korukanti
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: