Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21781

Modify DataSourceScanExec to use concrete ColumnVector type.

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      As mentioned at https://github.com/apache/spark/pull/18680#issuecomment-316820409, when we have more ColumnVector implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches.

      As for read path, one of the major paths is the one generated by ColumnBatchScan. Currently it refers ColumnVector so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses OnHeapColumnVector. We can use the concrete type in the generated code directly to avoid the penalty.

        Attachments

          Activity

            People

            • Assignee:
              ueshin Takuya Ueshin
              Reporter:
              ueshin Takuya Ueshin
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: