Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9228

ORC scanner could be vectorized

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Impala 4.0.0, Impala 3.4.0
    • None

    Description

      The ORC scanners uses an external library to read ORC files. The library reads the file contents into its own memory representation. It is a vectorized representation similar to the Arrow format.

      Impala needs to convert the ORC row batch to an Impala row batch. Currently the conversion happens row-wise via virtual function calls:

      https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/hdfs-orc-scanner.cc#L671

      https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L352

      Instead of this approach it could work similarly to the Parquet scanner that fills the columns one-by-one into a scratch batch, then evaluate the conjuncts on the scratch batch. For more details see HdfsParquetScanner::AssembleRows():

      https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1077-L1088

      This way we'll need a lot less virtual function calls, also the memory reads/writes will be much more localized and predictable.

      Attachments

        1. 1-4_col_measurement_int_only.png
          54 kB
          Gabor Kaszab

        Activity

          People

            gaborkaszab Gabor Kaszab
            boroknagyz Zoltán Borók-Nagy
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: