Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4160

Vectorized Query Execution in Hive

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The Hive query execution engine currently processes one row at a time. A single row of data goes through all the operators before the next row can be processed. This mode of processing is very inefficient in terms of CPU usage. Research has demonstrated that this yields very low instructions per cycle [MonetDB X100]. Also currently Hive heavily relies on lazy deserialization and data columns go through a layer of object inspectors that identify column type, deserialize data and determine appropriate expression routines in the inner loop. These layers of virtual method calls further slow down the processing.

      This work will add support for vectorized query execution to Hive, where, instead of individual rows, batches of about a thousand rows at a time are processed. Each column in the batch is represented as a vector of a primitive data type. The inner loop of execution scans these vectors very fast, avoiding method calls, deserialization, unnecessary if-then-else, etc. This substantially reduces CPU time used, and gives excellent instructions per cycle (i.e. improved processor pipeline utilization). See the attached design specification for more details.

        Attachments

        1. Hive-Vectorized-Query-Execution-Design.docx
          33 kB
          Jitendra Nath Pandey
        2. Hive-Vectorized-Query-Execution-Design-rev10.docx
          41 kB
          Eric Hanson
        3. Hive-Vectorized-Query-Execution-Design-rev10.docx
          41 kB
          Eric Hanson
        4. Hive-Vectorized-Query-Execution-Design-rev10.pdf
          665 kB
          Eric Hanson
        5. Hive-Vectorized-Query-Execution-Design-rev11.docx
          42 kB
          Eric Hanson
        6. Hive-Vectorized-Query-Execution-Design-rev11.pdf
          671 kB
          Eric Hanson
        7. Hive-Vectorized-Query-Execution-Design-rev2.docx
          31 kB
          Eric Hanson
        8. Hive-Vectorized-Query-Execution-Design-rev3.docx
          32 kB
          Eric Hanson
        9. Hive-Vectorized-Query-Execution-Design-rev3.docx
          32 kB
          Eric Hanson
        10. Hive-Vectorized-Query-Execution-Design-rev3.pdf
          596 kB
          Eric Hanson
        11. Hive-Vectorized-Query-Execution-Design-rev4.docx
          32 kB
          Eric Hanson
        12. Hive-Vectorized-Query-Execution-Design-rev4.pdf
          596 kB
          Eric Hanson
        13. Hive-Vectorized-Query-Execution-Design-rev5.docx
          34 kB
          Eric Hanson
        14. Hive-Vectorized-Query-Execution-Design-rev5.pdf
          609 kB
          Eric Hanson
        15. Hive-Vectorized-Query-Execution-Design-rev6.docx
          34 kB
          Eric Hanson
        16. Hive-Vectorized-Query-Execution-Design-rev6.pdf
          609 kB
          Eric Hanson
        17. Hive-Vectorized-Query-Execution-Design-rev7.docx
          35 kB
          Eric Hanson
        18. Hive-Vectorized-Query-Execution-Design-rev8.docx
          36 kB
          Eric Hanson
        19. Hive-Vectorized-Query-Execution-Design-rev8.pdf
          651 kB
          Eric Hanson
        20. Hive-Vectorized-Query-Execution-Design-rev9.docx
          39 kB
          Sarvesh Sakalanaga
        21. Hive-Vectorized-Query-Execution-Design-rev9.pdf
          657 kB
          Sarvesh Sakalanaga

          Issue Links

          1.
          Cleanup column type dependencies in vectorization aggregate code Sub-task Open Remus Rusanu
          2.
          Support DISTINCT in vectorized aggregates Sub-task Open Remus Rusanu
          3.
          Improve cache friendliness of VectorHashKeyWrapper Sub-task Open Remus Rusanu
          4.
          Remove unused org.apache.hadoop.hive.ql.exec Writables Sub-task Open Unassigned
          5.
          Implement vectorized text reader to read vectorized data from Text file Sub-task Patch Available Sarvesh Sakalanaga
          6.
          Support Hive specific DISTRIBUTE BY clause in VectorGroupByOperator Sub-task Open Remus Rusanu
          7.
          Optimize COUNT(*) aggregate over vectorized ORC execution path Sub-task Open Unassigned
          8.
          Float aggregate of single value loses precission Sub-task Open Remus Rusanu
          9.
          Allow prevention of string column re-use for string functions that can set results by reference Sub-task Open Unassigned
          10.
          VectorizedRowBatch member variables are public. Sub-task Reopened Jitendra Nath Pandey
          11.
          Follow convention for placing modifiers in variable declaration. Sub-task Open Jitendra Nath Pandey
          12.
          Avoid catching Throwable and converting them to exceptions. Sub-task Open Jitendra Nath Pandey
          13.
          Handle virtual columns and schema evolution in vector code path Sub-task Open Matt McCline
          14.
          Supported UDFs should have a separate annotation to indicate they are vectorizable. Sub-task Open Jitendra Nath Pandey
          15.
          Implement support for BETWEEN in SELECT list Sub-task Patch Available Navis
          16.
          Implement vectorized support for the DECIMAL data type Sub-task In Progress Eric Hanson
          17.
          query fails in vectorized mode on empty partitioned table Sub-task Open Unassigned
          18.
          Implement fast vectorized InputFormat extension for text files Sub-task Open Eric Hanson
          19.
          Extend the alltypesorc test table to include DECIMAL columns Sub-task Open Unassigned
          20.
          fix bug in UnsignedInt128.multiplyArrays4And4To8 and revert temporary fix in Decimal128.multiplyDestructive Sub-task Open Jitendra Nath Pandey
          21.
          Remove unnecessary white spaces in vectorization code Sub-task Patch Available Teddy Choi

            Activity

              People

              • Assignee:
                jnp Jitendra Nath Pandey
                Reporter:
                jnp Jitendra Nath Pandey
              • Votes:
                2 Vote for this issue
                Watchers:
                56 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 168h
                  168h
                  Remaining:
                  Remaining Estimate - 168h
                  168h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified