Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-131

Vectorized Reader In Parquet

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parquet-mr
    • Labels:
      None

      Description

      Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. Within one batch, each column is represented as a vector of a primitive data type. SQL engines could apply predicates very efficiently on these vectors, avoiding a single row going through all the operators before the next row can be processed.
      As an efficient columnar data representation, it would be nice if Parquet could support Vectorized APIs, so that all SQL engines could read vectors from Parquet files, and do vectorized execution for Parquet File Format.

      Detail proposal:
      https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30

        Attachments

        1. ParquetInPresto.pdf
          79 kB
          Zhenxiao Luo
        2. Parquet-Vectorized-APIs.pdf
          591 kB
          Dong Chen

          Issue Links

            Activity

              People

              • Assignee:
                zhenxiao Zhenxiao Luo
                Reporter:
                zhenxiao Zhenxiao Luo
              • Votes:
                1 Vote for this issue
                Watchers:
                36 Start watching this issue

                Dates

                • Created:
                  Updated: