Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-3884

Hive native scan has lower parallelization leading to performance degradation

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Currently HiveDrillNativeParquetScan.getScanStats() divides the rowCount got from HiveScan by a factor and returns that as cost. Problem is all cost calculations and parallelization depends on the rowCount. Value cpuCost is not taken into consideration in current cost calculations in ScanPrel. In order for the planner to choose HiveDrillNativeParquetScan over HiveScan, rowCount has to be lowered for the former, but this leads to lower parallelization and performance degradation.

      Temporary fix for Drill 1.2 before DRILL-3856 fully resolves considering CPU cost in cost model:
      1. Change ScanPrel to consider the CPU cost in given Stats from GroupScan
      2. Have higher CPU cost for HiveScan (SerDe route)
      3. Lower CPU cost for HiveDrillNativeParquetScan

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            cchang@maprtech.com Chun Chang
            vkorukanti Venki Korukanti
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment