Hive
  1. Hive
  2. HIVE-4160

Vectorized Query Execution in Hive

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The Hive query execution engine currently processes one row at a time. A single row of data goes through all the operators before the next row can be processed. This mode of processing is very inefficient in terms of CPU usage. Research has demonstrated that this yields very low instructions per cycle [MonetDB X100]. Also currently Hive heavily relies on lazy deserialization and data columns go through a layer of object inspectors that identify column type, deserialize data and determine appropriate expression routines in the inner loop. These layers of virtual method calls further slow down the processing.

      This work will add support for vectorized query execution to Hive, where, instead of individual rows, batches of about a thousand rows at a time are processed. Each column in the batch is represented as a vector of a primitive data type. The inner loop of execution scans these vectors very fast, avoiding method calls, deserialization, unnecessary if-then-else, etc. This substantially reduces CPU time used, and gives excellent instructions per cycle (i.e. improved processor pipeline utilization). See the attached design specification for more details.

      1. Hive-Vectorized-Query-Execution-Design-rev11.pdf
        671 kB
        Eric Hanson
      2. Hive-Vectorized-Query-Execution-Design-rev11.docx
        42 kB
        Eric Hanson
      3. Hive-Vectorized-Query-Execution-Design-rev10.docx
        41 kB
        Eric Hanson
      4. Hive-Vectorized-Query-Execution-Design-rev10.pdf
        665 kB
        Eric Hanson
      5. Hive-Vectorized-Query-Execution-Design-rev10.docx
        41 kB
        Eric Hanson
      6. Hive-Vectorized-Query-Execution-Design-rev9.pdf
        657 kB
        Sarvesh Sakalanaga
      7. Hive-Vectorized-Query-Execution-Design-rev9.docx
        39 kB
        Sarvesh Sakalanaga
      8. Hive-Vectorized-Query-Execution-Design-rev8.pdf
        651 kB
        Eric Hanson
      9. Hive-Vectorized-Query-Execution-Design-rev8.docx
        36 kB
        Eric Hanson
      10. Hive-Vectorized-Query-Execution-Design-rev7.docx
        35 kB
        Eric Hanson
      11. Hive-Vectorized-Query-Execution-Design-rev6.pdf
        609 kB
        Eric Hanson
      12. Hive-Vectorized-Query-Execution-Design-rev6.docx
        34 kB
        Eric Hanson
      13. Hive-Vectorized-Query-Execution-Design-rev5.pdf
        609 kB
        Eric Hanson
      14. Hive-Vectorized-Query-Execution-Design-rev5.docx
        34 kB
        Eric Hanson
      15. Hive-Vectorized-Query-Execution-Design-rev4.pdf
        596 kB
        Eric Hanson
      16. Hive-Vectorized-Query-Execution-Design-rev4.docx
        32 kB
        Eric Hanson
      17. Hive-Vectorized-Query-Execution-Design-rev3.docx
        32 kB
        Eric Hanson
      18. Hive-Vectorized-Query-Execution-Design-rev3.pdf
        596 kB
        Eric Hanson
      19. Hive-Vectorized-Query-Execution-Design-rev3.docx
        32 kB
        Eric Hanson
      20. Hive-Vectorized-Query-Execution-Design-rev2.docx
        31 kB
        Eric Hanson
      21. Hive-Vectorized-Query-Execution-Design.docx
        33 kB
        Jitendra Nath Pandey

        Issue Links

        1.
        Implement vectorized logical expressions. Sub-task Resolved Jitendra Nath Pandey  
         
        2.
        Implement vectorized column-scalar expressions Sub-task Resolved Jitendra Nath Pandey  
         
        3.
        Implement class for vectorized row batch Sub-task Resolved Eric Hanson  
         
        4.
        Implement classes for column vectors. Sub-task Resolved Eric Hanson  
         
        5.
        Change ORC tree readers to return batches of rows instead of a row Sub-task Resolved Sarvesh Sakalanaga  
         
        6.
        Implement Vectorized Column-Column expressions Sub-task Resolved Jitendra Nath Pandey  
         
        7.
        Implement Vectorized Scalar-Column expressions Sub-task Resolved Eric Hanson  
         
        8.
        Implement vectorized aggregation expressions Sub-task Resolved Remus Rusanu  
         
        9.
        Implement vectorized string column-scalar filters Sub-task Resolved Eric Hanson  
         
        10.
        Implement vectorized string functions UPPER(), LOWER(), LENGTH() Sub-task Resolved Eric Hanson  
         
        11.
        Implement vectorized LIKE filter Sub-task Resolved Eric Hanson  
         
        12.
        Vectorized filter and select operators Sub-task Resolved Jitendra Nath Pandey  
         
        13.
        Generate vectorized execution plan Sub-task Resolved Jitendra Nath Pandey  
         
        14.
        Vectorized expression for unary minus. Sub-task Resolved Jitendra Nath Pandey  
         
        15.
        Implement vectorized string concatenation Sub-task Resolved Eric Hanson  
         
        16.
        Extend Vector Aggregates to support GROUP BY Sub-task Resolved Remus Rusanu  
         
        17.
        Add support for string column type vector aggregates: COUNT, MIN and MAX Sub-task Resolved Remus Rusanu  
         
        18.
        Add support for COUNT(*) in vector aggregates Sub-task Resolved Remus Rusanu  
         
        19.
        Input format to read vector data from ORC Sub-task Resolved Jitendra Nath Pandey  
         
        20.
        Support partitioned tables in vectorized query execution. Sub-task Resolved Jitendra Nath Pandey  
         
        21.
        Queries not supported by vectorized code path should fall back to non vector path. Sub-task Resolved Jitendra Nath Pandey  
         
        22.
        set isRepeating to false by default in ColumnArithmeticColumn.txt Sub-task Resolved Eric Hanson  
         
        23.
        Finish support for modulo (%) operator for vectorized arithmetic Sub-task Resolved Eric Hanson  
         
        24.
        Add unit tests for vectorized IS NULL and IS NOT NULL filters Sub-task Resolved Jitendra Nath Pandey  
         
        25.
        Extend plan vectorization to cover GroupByOperator Sub-task Resolved Remus Rusanu  
         
        26.
        OR, NOT Filter logic can lose an array, and always takes time O(VectorizedRowBatch.DEFAULT_SIZE) Sub-task Resolved Jitendra Nath Pandey  
         
        27.
        Improvement in logical expressions and checkstyle fixes. Sub-task Resolved Jitendra Nath Pandey  
         
        28.
        remove redundant copy of arithmetic filter unit test testColOpScalarNumericFilterNullAndRepeatingLogic Sub-task Resolved Eric Hanson  
         
        29.
        In ORC, add boolean noNulls flag to column stripe metadata Sub-task Closed Prasanth Jayachandran  
         
        30.
        Child expressions are not being evaluated hierarchically in a few templates. Sub-task Resolved Jitendra Nath Pandey  
         
        31.
        Implement partition support for vectorized query execution Sub-task Resolved Sarvesh Sakalanaga  
         
        32.
        Vectorized row batch should be initialized with additional columns to hold intermediate output. Sub-task Resolved Jitendra Nath Pandey  
         
        33.
        Template file VectorUDAFAvg.txt missing from public branch; CodeGen.java fails Sub-task Resolved Remus Rusanu  
         
        34.
        Input format to read vector data from RC file Sub-task Resolved Sarvesh Sakalanaga  
         
        35.
        Implement vectorized filter for string column compared to string column Sub-task Resolved Eric Hanson  
         
        36.
        Implement vectorized string substr Sub-task Resolved Timothy Chen  
         
        37.
        Integer division should be cast to double. Sub-task Resolved Jitendra Nath Pandey  
         
        38.
        Vectorized reader support for Byte Boolean and Timestamp. Sub-task Resolved Sarvesh Sakalanaga  
         
        39.
        The vectorized plan is not picking right expression class for string concatenation. Sub-task Resolved Eric Hanson  
         
        40.
        Handle constants in projection Sub-task Resolved Jitendra Nath Pandey  
         
        41.
        Add partition support for vectorized ORC Input format Sub-task Resolved Sarvesh Sakalanaga  
         
        42.
        vectorized NotCol operation does not handle short-circuit evaluation for NULL propagation correctly Sub-task Resolved Jitendra Nath Pandey  
         
        43.
        IsNotNull and NotCol incorrectly handle nulls. Sub-task Resolved Jitendra Nath Pandey  
         
        44.
        select * fails on orc table when vectorization is enabled Sub-task Resolved Sarvesh Sakalanaga  
         
        45.
        only explicit int type works e2e. tiny,small, and big all fail with: org.apache.hadoop.hive.ql.metadata.HiveException: Unsuported JIT vectorization column type Sub-task Resolved Tony Murphy  
         
        46.
        Move test utils and fix build to remove false test failures Sub-task Resolved Tony Murphy  
         
        47.
        Run check-style on the branch and fix style issues. Sub-task Resolved Jitendra Nath Pandey  
         
        48.
        VectorizedRowBatchCtx::CreateVectorizedRowBatch should create only the projected columns and not all columns Sub-task Resolved Sarvesh Sakalanaga  
         
        49.
        Speed up vectorized LIKE filter for special cases abc%, %abc and %abc% Sub-task Resolved Teddy Choi  
         
        50.
        Vectorized RecordReader for ORC does not set the ColumnVector.IsRepeating correctly Sub-task Resolved Sarvesh Sakalanaga  
         
        51.
        Column Column, and Column Scalar vectorized execution tests Sub-task Resolved Tony Murphy  
         
        52.
        In place filtering in Not Filter doesn't handle nulls correctly. Sub-task Resolved Jitendra Nath Pandey  
         
        53.
        fix failure to set output isNull to true and other NULL propagation issues; update arithmetic tests Sub-task Resolved Eric Hanson  
         
        54.
        Support strings in GROUP BY keys Sub-task Resolved Remus Rusanu  
         
        55.
        Fix serialization exceptions in VectorGroupByOperator Sub-task Resolved Remus Rusanu  
         
        56.
        Remove test code from ql\src\java tree, place it itn ql\src\test tree Sub-task Resolved Tony Murphy  
         
        57.
        VectorGroupByOperator steals the non-vectorized children and crashes query if vectorization fails Sub-task Resolved Jitendra Nath Pandey  
         
        58.
        Vectorized reader support for timestamp in ORC. Sub-task Resolved Sarvesh Sakalanaga  
         
        59.
        Enable running all hive e2e tests under vectorization Sub-task Resolved Tony Murphy  
         
        60.
        VectorSelectOperator projections change the index of columns for subsequent operators. Sub-task Resolved Jitendra Nath Pandey  
         
        61. Cleanup column type dependencies in vectorization aggregate code Sub-task Open Remus Rusanu  
         
        62.
        Implement vector group by hash spill Sub-task Resolved Remus Rusanu  
         
        63. Support DISTINCT in vectorized aggregates Sub-task Open Remus Rusanu  
         
        64.
        Vectorized UDFs for Timestamp in nanoseconds Sub-task Resolved Gopal V  
         
        65.
        Vectorized aggregates do not emit proper rows in presence of GROUP BY Sub-task Resolved Remus Rusanu  
         
        66. Improve cache friendliness of VectorHashKeyWrapper Sub-task Open Remus Rusanu  
         
        67.
        Integrate Vectorized Substr into Vectorized QE Sub-task Resolved Eric Hanson  
         
        68.
        Fix VectorUDAFSum.txt to honor the expected vector column type Sub-task Resolved Remus Rusanu  
         
        69.
        CommonOrcInputFormat should be the default input format for Orc tables. Sub-task Resolved Sarvesh Sakalanaga  
         
        70.
        Implement vectorized RLIKE and REGEXP filter expressions Sub-task Resolved Teddy Choi  
         
        71.
        Unit test failure in TestColumnScalarOperationVectorExpressionEvaluation Sub-task Resolved Jitendra Nath Pandey  
         
        72.
        TestVectorGroupByOperator causes asserts in StandardStructObjectInspector.init Sub-task Resolved Remus Rusanu  
         
        73.
        VectorHashKeyWrapperBatch.java should be in vector package (instead of exec) Sub-task Resolved Remus Rusanu  
         
        74.
        Favor serde2.io Writable classes over hadoop.io ones Sub-task Resolved Remus Rusanu  
         
        75. Remove unused org.apache.hadoop.hive.ql.exec Writables Sub-task Open Unassigned  
         
        76.
        Vectorization not working with negative constants, hive doesn't fold constants. Sub-task Resolved Jitendra Nath Pandey  
         
        77. Implement vectorized text reader to read vectorized data from Text file Sub-task Patch Available Sarvesh Sakalanaga  
         
        78. Support Hive specific DISTRIBUTE BY clause in VectorGroupByOperator Sub-task Open Remus Rusanu  
         
        79.
        error at VectorExecMapper.close in group-by-agg query over ORC, vectorized Sub-task Resolved Jitendra Nath Pandey  
         
        80.
        Count(*) over tpch lineitem ORC results in Error: Java heap space Sub-task Resolved Sarvesh Sakalanaga  
         
        81.
        tpch query 1 fails with java.lang.ClassCastException Sub-task Resolved Jitendra Nath Pandey  
         
        82.
        wrong results for query with modulo (%) in WHERE clause filter Sub-task Resolved Sarvesh Sakalanaga  
         
        83.
        Use VectorExpessionWriter to write column vectors into Writables. Sub-task Resolved Jitendra Nath Pandey  
         
        84. Optimize COUNT(*) aggregate over vectorized ORC execution path Sub-task Open Unassigned  
         
        85.
        second clause of AND, OR filter not applied for vectorized execution Sub-task Resolved Jitendra Nath Pandey  
         
        86.
        second clause of OR filter not applied in vectorized query execution Sub-task Resolved Jitendra Nath Pandey  
         
        87.
        Fix ORC TimestampTreeReader.nextVector() to handle milli-nano math corectly Sub-task Resolved Gopal V  
         
        88.
        Query with filter constant on left of "=" and column expression on right does not vectorize Sub-task Resolved Jitendra Nath Pandey  
         
        89.
        query using LIKE does not vectorize Sub-task Resolved Eric Hanson  
         
        90.
        Max on float returning wrong results Sub-task Resolved Remus Rusanu  
         
        91.
        incorrect result for max aggregate over int column Sub-task Resolved Remus Rusanu  
         
        92.
        NPE in writing null values. Sub-task Resolved Jitendra Nath Pandey  
         
        93.
        Unit test failure in TestColumnColumnOperationVectorExpressionEvaluation Sub-task Resolved Eric Hanson  
         
        94.
        Fix ORC TestVectorizedORCReader testcase for Timestamps Sub-task Resolved Gopal V  
         
        95.
        Integrate basic UDFs for Timesamp Sub-task Resolved Gopal V  
         
        96.
        Optimize filter Column IN ( list-of-constants ) for vectorized execution Sub-task Resolved Unassigned  
         
        97.
        Unit test failure TestVectorSelectOperator Sub-task Resolved Jitendra Nath Pandey  
         
        98.
        TestCase FakeVectorRowBatchFromObjectIterables error Sub-task Resolved Eric Hanson  
         
        99.
        Query on Table with partition columns fail with AlreadyBeingCreatedException Sub-task Resolved Sarvesh Sakalanaga  
         
        100.
        Vectorized Sum of scalar subtract column returns negative result when positive exected Sub-task Resolved Jitendra Nath Pandey  
         
        101.
        Classcast exception with two group by keys of types string and tinyint. Sub-task Resolved Remus Rusanu  
         
        102.
        array out of bounds exception near VectorHashKeyWrapper.getBytes() with 2 column GROUP BY Sub-task Resolved Remus Rusanu  
         
        103.
        MIN on timestamp column gives incorrect result. Sub-task Resolved Gopal V  
         
        104.
        Optimize ORC StringTreeReader::nextVector to not create dictionary of strings for each call to nextVector Sub-task Resolved Sarvesh Sakalanaga  
         
        105. Float aggregate of single value loses precission Sub-task Open Remus Rusanu  
         
        106.
        Unary Minus Expression Throwing java.lang.NullPointerException Sub-task Resolved Jitendra Nath Pandey  
         
        107.
        java.lang.RuntimeException: Hive Runtime Error while closing operators: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DoubleWritable Sub-task Resolved Jitendra Nath Pandey  
         
        108.
        OrcInputFormat should be enhanced to provide vectorized input. Sub-task Resolved Jitendra Nath Pandey  
         
        109.
        NULLs and record separators broken with vectorization branch intermediate outputs Sub-task Resolved Gopal V  
         
        110.
        Vectorized ORC reader does not handle absence of column present stream correctly. Sub-task Resolved Sarvesh Sakalanaga  
         
        111.
        Null Pointer Exception in Group By Operator Sub-task Resolved Jitendra Nath Pandey  
         
        112.
        Hive Runtime Error while closing operators: java.lang.NullPointerException Sub-task Resolved Remus Rusanu  
         
        113.
        Incorrect aggregate results Sub-task Resolved Remus Rusanu  
         
        114.
        make vectorized LOWER(), UPPER(), LENGTH() work end-to-end; support expression input for vectorized LIKE Sub-task Resolved Eric Hanson  
         
        115.
        Unit e2e tests for vectorization Sub-task Resolved Tony Murphy  
         
        116.
        Implement vectorized type casting for all types Sub-task Resolved Eric Hanson  
         
        117.
        implement vectorized math functions Sub-task Resolved Eric Hanson  
         
        118.
        implement vectorized TRIM(), LTRIM(), RTRIM() Sub-task Resolved Eric Hanson  
         
        119.
        Make vectorization branch compile under JDK 7 Sub-task Resolved Ashutosh Chauhan  
         
        120.
        Implement Vectorized Limit Operator Sub-task Resolved Sarvesh Sakalanaga  
         
        121.
        std, stddev and stddev_pop aggregates on double/float fail to vectorize Sub-task Resolved Remus Rusanu  
         
        122.
        Implement vectorized JOIN operators Sub-task Resolved Remus Rusanu  
         
        123.
        String column comparison classes should be renamed. Sub-task Resolved Jitendra Nath Pandey  
         
        124.
        ORC TimestampTreeReader.nextVector() off by a second when time in fractional Sub-task Resolved Gopal V  
         
        125.
        make vectorized math functions work end-to-end (update VectorizationContext.java) Sub-task Resolved Eric Hanson  
         
        126.
        Vectorized ORC reader does not set isRepeating flag correctly when 1’s are present is the input stream Sub-task Resolved Sarvesh Sakalanaga  
         
        127.
        create template for string scalar compared with string column Sub-task Resolved Eric Hanson  
         
        128.
        MAX/MIN aggregates yield incorrect results Sub-task Resolved Remus Rusanu  
         
        129.
        Make RLIKE/REGEXP run end-to-end by updating VectorizationContext Sub-task Resolved Teddy Choi  
         
        130. Allow prevention of string column re-use for string functions that can set results by reference Sub-task Open Unassigned  
         
        131.
        Vectorized plan generation should be added as an optimization transform. Sub-task Resolved Jitendra Nath Pandey  
         
        132.
        Create bridge for custom UDFs to operate in vectorized mode Sub-task Resolved Eric Hanson  
         
        133.
        Unit test failure in TestVectorTimestampExpressions Sub-task Resolved Gopal V  
         
        134.
        Consolidate and simplify vectorization code and test generation Sub-task Resolved Tony Murphy  
         
        135.
        Make vector expressions serializable. Sub-task Resolved Jitendra Nath Pandey  
         
        136.
        FilterExprOrExpr changes the order of the rows Sub-task Resolved Jitendra Nath Pandey  
         
        137.
        Vector operators should inherit from non-vector operators for code re-use. Sub-task Resolved Jitendra Nath Pandey  
         
        138.
        Enhance explain to indicate vectorized execution of operators. Sub-task Resolved Jitendra Nath Pandey  
         
        139.
        orc_create.q and other orc tests fail on the branch. Sub-task Resolved Jitendra Nath Pandey  
         
        140.
        The code generation should be part of the build process. Sub-task Resolved Jitendra Nath Pandey  
         
        141.
        Update hive-default.xml.template for vectorization flag; remove unused imports from MetaStoreUtils.java Sub-task Resolved Jitendra Nath Pandey  
         
        142.
        Commit vectorization test data, comment/rename vectorization tests. Sub-task Resolved Tony Murphy  
         
        143.
        Boolean constants in the query are not handled correctly. Sub-task Resolved Jitendra Nath Pandey  
         
        144. VectorizedRowBatch member variables are public. Sub-task Reopened Jitendra Nath Pandey  
         
        145. Follow convention for placing modifiers in variable declaration. Sub-task Open Jitendra Nath Pandey  
         
        146. Avoid catching Throwable and converting them to exceptions. Sub-task Open Jitendra Nath Pandey  
         
        147.
        Refactor VectorizationContext and handle NOT expression with nulls. Sub-task Resolved Jitendra Nath Pandey  
         
        148.
        Vectorization throws exception with nested UDF. Sub-task Resolved Jitendra Nath Pandey  
         
        149.
        TopN optimization in VectorReduceSink Sub-task Resolved Sergey Shelukhin  
         
        150.
        Implement end-to-end tests for vectorized string and math functions, and casts Sub-task Resolved Eric Hanson  
         
        151.
        Vectorized query failing for partitioned tables. Sub-task Resolved Jitendra Nath Pandey  
         
        152. Handle virtual columns and schema evolution in vector code path Sub-task Open Matt McCline  
         
        153.
        Implement vectorized year/month/day... etc. for string arguments Sub-task Resolved Teddy Choi  
         
        154.
        Implement BETWEEN filter in vectorized mode Sub-task Resolved Eric Hanson  
         
        155.
        Implement support for IN (list-of-constants) filter in vectorized mode Sub-task Resolved Eric Hanson  
         
        156.
        Write initial user documentation for vectorized query on Hive Wiki Sub-task Resolved Eric Hanson  
         
        157.
        Exception in vectorized map join. Sub-task Resolved Jitendra Nath Pandey  
         
        158.
        Implement vectorized SMB JOIN Sub-task Resolved Remus Rusanu

        0%

        Original Estimate - 168h
        Remaining Estimate - 168h
         
        159.
        Fix validation of nested expressions. Sub-task Resolved Jitendra Nath Pandey  
         
        160.
        Exception in UDFs with large number of arguments. Sub-task Resolved Jitendra Nath Pandey  
         
        161.
        Vectorized Shuffle Join produces incorrect results Sub-task Resolved Remus Rusanu  
         
        162. Supported UDFs should have a separate annotation to indicate they are vectorizable. Sub-task Open Jitendra Nath Pandey  
         
        163.
        Validation doesn't catch SMBMapJoin Sub-task Resolved Jitendra Nath Pandey  
         
        164.
        Intermediate columns are incorrectly initialized for partitioned tables. Sub-task Resolved Jitendra Nath Pandey  
         
        165.
        Add unit test for vectorized BETWEEN for timestamp inputs Sub-task Resolved Eric Hanson  
         
        166. Implement support for BETWEEN in SELECT list Sub-task Patch Available Navis  
         
        167.
        Implement vectorization support for IF conditional expression for long, double, timestamp, boolean and string inputs Sub-task Resolved Eric Hanson  
         
        168.
        Implement vectorized support for CASE Sub-task Resolved Eric Hanson  
         
        169.
        Implement vectorized support for NOT IN filter Sub-task Resolved Eric Hanson  
         
        170.
        Implement vectorized support for COALESCE conditional expression Sub-task Resolved Jitendra Nath Pandey  
         
        171.
        Implement vectorized support for the DATE data type Sub-task Resolved Teddy Choi  
         
        172. Implement vectorized support for the DECIMAL data type Sub-task In Progress Eric Hanson  
         
        173.
        Implement vectorization support for IF conditional expression for boolean and timestamp inputs Sub-task Resolved Eric Hanson  
         
        174.
        Implement vectorization support for IF conditional expression for string inputs Sub-task Resolved Eric Hanson  
         
        175. query fails in vectorized mode on empty partitioned table Sub-task Open Unassigned  
         
        176.
        Implement vectorized support for IN as boolean-valued expression Sub-task Resolved Eric Hanson  
         
        177.
        Implement vectorized support for CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END Sub-task Resolved Unassigned  
         
        178.
        Rollups not supported in vector mode. Sub-task Resolved Jitendra Nath Pandey  
         
        179.
        Failure in cast to timestamps. Sub-task Resolved Jitendra Nath Pandey  
         
        180.
        Add vectorized reader for Parquet files Sub-task Closed Remus Rusanu  
         
        181.
        Contribute Decimal128 high-performance decimal(p, s) package from Microsoft to Hive Sub-task Resolved Eric Hanson  
         
        182.
        Create DecimalColumnVector and a representative VectorExpression for decimal Sub-task Resolved Eric Hanson  
         
        183.
        Implement vectorized decimal comparison filters Sub-task Resolved Eric Hanson  
         
        184.
        Support basic Decimal arithmetic in vector mode (+, -, *) Sub-task Resolved Eric Hanson  
         
        185.
        Implement vectorized decimal division and modulo Sub-task Resolved Eric Hanson  
         
        186.
        Implement vectorized reader for Date datatype for ORC format. Sub-task Resolved Jitendra Nath Pandey  
         
        187.
        Implement vectorized reader for DECIMAL datatype for ORC format. Sub-task Resolved Jitendra Nath Pandey  
         
        188.
        Implement vectorized type cast from/to decimal(p, s) Sub-task Resolved Eric Hanson  
         
        189.
        error in vectorized Column-Column comparison filter for repeating case Sub-task Resolved Eric Hanson  
         
        190.
        Make Vector Group By operator abandon grouping if too many distinct keys Sub-task Resolved Remus Rusanu  
         
        191. Implement fast vectorized InputFormat extension for text files Sub-task Open Eric Hanson  
         
        192.
        error in high-precision division for Decimal128 Sub-task Resolved Eric Hanson  
         
        193.
        Add more unit tests for high-precision Decimal128 arithmetic Sub-task Resolved Eric Hanson  
         
        194.
        VectorExpressionWriter for date and decimal datatypes. Sub-task Resolved Jitendra Nath Pandey  
         
        195.
        Generate vectorized plan for decimal expressions. Sub-task Resolved Jitendra Nath Pandey  
         
        196.
        Add DECIMAL support to vectorized group by operator Sub-task Resolved Remus Rusanu  
         
        197.
        Add DECIMAL support to vectorized JOIN operators Sub-task Resolved Remus Rusanu  
         
        198.
        Column name map is broken Sub-task Resolved Jitendra Nath Pandey  
         
        199. Extend the alltypesorc test table to include DECIMAL columns Sub-task Open Unassigned  
         
        200.
        Implement vectorized unary minus for decimal Sub-task Resolved Jitendra Nath Pandey  
         
        201.
        bug in high-precision Decimal128 multiply Sub-task Resolved Eric Hanson  
         
        202.
        Vectorized mathematical functions for decimal type. Sub-task Resolved Jitendra Nath Pandey  
         
        203. fix bug in UnsignedInt128.multiplyArrays4And4To8 and revert temporary fix in Decimal128.multiplyDestructive Sub-task Open Jitendra Nath Pandey  
         
        204.
        Queries fail to Vectorize. Sub-task Resolved Jitendra Nath Pandey  
         

          Activity

          Hide
          Jitendra Nath Pandey added a comment -

          This will be an incremental work in multiple phases with no regression on current system. We will publish a design/scope document very soon.
          The main idea behind the proposal is to transform the execution engine to process a row batch at a time instead of a single row. The row batch will consist of column vectors and each operator will process the whole column vector at a time. The column vector will consist of array(s) of primitive types as far as possible.
          The expressions will be implemented for various data types using pre-compiled templates. The appropriate expressions will be added to the operators based on data types.
          A vectorized iterator interface will be implemented by the file formats to provide vectorized input to the operator tree.

          Show
          Jitendra Nath Pandey added a comment - This will be an incremental work in multiple phases with no regression on current system. We will publish a design/scope document very soon. The main idea behind the proposal is to transform the execution engine to process a row batch at a time instead of a single row. The row batch will consist of column vectors and each operator will process the whole column vector at a time. The column vector will consist of array(s) of primitive types as far as possible. The expressions will be implemented for various data types using pre-compiled templates. The appropriate expressions will be added to the operators based on data types. A vectorized iterator interface will be implemented by the file formats to provide vectorized input to the operator tree.
          Hide
          Jitendra Nath Pandey added a comment -
          Show
          Jitendra Nath Pandey added a comment - Reference on MonetDB: http://www-db.cs.wisc.edu/cidr/cidr2005/papers/P19.pdf
          Hide
          Eric Hanson added a comment -

          This is part of the Stinger initiative. http://hortonworks.com/blog/100x-faster-hive/

          Show
          Eric Hanson added a comment - This is part of the Stinger initiative. http://hortonworks.com/blog/100x-faster-hive/
          Hide
          Jitendra Nath Pandey added a comment -

          The attached document covers the outline of the design. Any comments/feedback are welcome. We will keep updating the document with more details as we include more data types, operators and expressions. We will also include the vectorized iterator design into the document.

          Show
          Jitendra Nath Pandey added a comment - The attached document covers the outline of the design. Any comments/feedback are welcome. We will keep updating the document with more details as we include more data types, operators and expressions. We will also include the vectorized iterator design into the document.
          Hide
          Eric Hanson added a comment -

          Added section on requirements for implementation of vectorized iterator, with respect to how to load VectorizedRowBatch object on each call to next().

          Show
          Eric Hanson added a comment - Added section on requirements for implementation of vectorized iterator, with respect to how to load VectorizedRowBatch object on each call to next().
          Hide
          Steve Loughran added a comment -

          We couldn't have a copy of the doc in PDF stuck up at the same time as the editable one could we?

          Show
          Steve Loughran added a comment - We couldn't have a copy of the doc in PDF stuck up at the same time as the editable one could we?
          Hide
          Eric Hanson added a comment -

          Fixed a bug in example, plus made minor wording changes in introduction.

          Show
          Eric Hanson added a comment - Fixed a bug in example, plus made minor wording changes in introduction.
          Hide
          Eric Hanson added a comment -

          Adding pdf of design doc per request.

          Show
          Eric Hanson added a comment - Adding pdf of design doc per request.
          Hide
          Eric Hanson added a comment -

          updated version # and date

          Show
          Eric Hanson added a comment - updated version # and date
          Hide
          Eric Hanson added a comment -

          Updated design document with discussion of precise handling and interpretation of all-non-null (noNulls) and all identical (isRepeating) column vectors.

          Also included discussion of TIMESTAMP internal vector representation as long integer number of nonseconds since the epoch.

          Show
          Eric Hanson added a comment - Updated design document with discussion of precise handling and interpretation of all-non-null (noNulls) and all identical (isRepeating) column vectors. Also included discussion of TIMESTAMP internal vector representation as long integer number of nonseconds since the epoch.
          Hide
          Eric Hanson added a comment -

          The code for this work is currently in the "vectorization" branch of the public Hive repo.

          Show
          Eric Hanson added a comment - The code for this work is currently in the "vectorization" branch of the public Hive repo.
          Hide
          Eric Hanson added a comment -

          Added discussion of timestamp values before the epoch (in 1970) related to HIVE-4525.

          Show
          Eric Hanson added a comment - Added discussion of timestamp values before the epoch (in 1970) related to HIVE-4525 .
          Hide
          Eric Hanson added a comment -

          Updated design spec with new section by Remus Rusanu about vectorized group-by/aggregate. I edited it a little bit and added the final paragraph on future considerations.

          Show
          Eric Hanson added a comment - Updated design spec with new section by Remus Rusanu about vectorized group-by/aggregate. I edited it a little bit and added the final paragraph on future considerations.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Hi folks,
          What an incredible amount of work! Looks fantastic, looking forward to this.

          It seems like the general idea of a vectorized operator is not Hive-specific. Is there any possibility of abstracting the core logic of an operator that can efficiently process a stream of data, such as what you get from ORCFile, and return the computed results?

          Having such a library be available independently of Hive would allow reuse in other Hadoop ecosystem projects (Pig, Cascading, Drill, etc) without the need to reinvent the wheel, and would also bring the whole community behind optimizing one set of operators instead of continuing the existing fragmented state of the world.

          The process of separating out such a library might also yield benefits in terms of winding up with a cleaner design and better abstractions (that's been my experience when going through similar exercises on other projects – I don't have any reason to think your current design is not clean or doesn't have good abstractions).

          Do you have any thoughts on how this could be achieved? Does this sound like something you would be interested in? Is there something that people currently working on other projects can do to help this become a reality?

          Show
          Dmitriy V. Ryaboy added a comment - Hi folks, What an incredible amount of work! Looks fantastic, looking forward to this. It seems like the general idea of a vectorized operator is not Hive-specific. Is there any possibility of abstracting the core logic of an operator that can efficiently process a stream of data, such as what you get from ORCFile, and return the computed results? Having such a library be available independently of Hive would allow reuse in other Hadoop ecosystem projects (Pig, Cascading, Drill, etc) without the need to reinvent the wheel, and would also bring the whole community behind optimizing one set of operators instead of continuing the existing fragmented state of the world. The process of separating out such a library might also yield benefits in terms of winding up with a cleaner design and better abstractions (that's been my experience when going through similar exercises on other projects – I don't have any reason to think your current design is not clean or doesn't have good abstractions). Do you have any thoughts on how this could be achieved? Does this sound like something you would be interested in? Is there something that people currently working on other projects can do to help this become a reality?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          A huge +1 to that. Having a common set of operators will be a huge win. That said, I already see that the current branch follows Hive's operator base classes, uses HiveConf etc. I believe with little effort, this can be cleaned and pulled apart into one separate maven module that everyone can use.

          Some points to think about:

          • The target location of the module. The dependency graph can become un-wieldly.
          • Given the use of base Operator, OperatorDesc etc from Hive, if at all there is interest and commitment, we should do this ASAP when we only have a handful of operators.
          • Make one other project demonstrate how it can be reused across ecosystem projects, PIG will be great - just a few operators will be a great start

          Thoughts?

          Show
          Vinod Kumar Vavilapalli added a comment - A huge +1 to that. Having a common set of operators will be a huge win. That said, I already see that the current branch follows Hive's operator base classes, uses HiveConf etc. I believe with little effort, this can be cleaned and pulled apart into one separate maven module that everyone can use. Some points to think about: The target location of the module. The dependency graph can become un-wieldly. Given the use of base Operator, OperatorDesc etc from Hive, if at all there is interest and commitment, we should do this ASAP when we only have a handful of operators. Make one other project demonstrate how it can be reused across ecosystem projects, PIG will be great - just a few operators will be a great start Thoughts?
          Hide
          Eric Hanson added a comment -

          Dmitry and Vinod,

          What specifically do you want to do with the code once it is factored out?

          Eric

          Show
          Eric Hanson added a comment - Dmitry and Vinod, What specifically do you want to do with the code once it is factored out? Eric
          Hide
          Dmitriy V. Ryaboy added a comment -

          I would like to provide the same vectorization benefits to Pig and similar frameworks (possibly Cascading, and maybe the Spark or Crunch guys will want to use this as well, etc).

          Show
          Dmitriy V. Ryaboy added a comment - I would like to provide the same vectorization benefits to Pig and similar frameworks (possibly Cascading, and maybe the Spark or Crunch guys will want to use this as well, etc).
          Hide
          Jitendra Nath Pandey added a comment -

          Dmitry, Vinod
          There is significant amount of vectorization work in expression evaluation for example, arithmetic expressions or logical expressions or aggregations etc. Many of these expressions are pretty generic and different systems are likely to have similar semantics for these. It should be possible to re-use this code with little change in pig or other systems. It will be required to use same vectorized representation of data in the processing engine to re-use these expressions, but that part of code is also generic and re-usable. I think that could be a good starting point.
          However, a bunch of the vectorization work is in operator code where we have vectorized version of the hive operators. These operators are closely tied with hive semantics and implementation. Therefore, it will need some restructuring in hive code base as well to generalize these operators for re-use in other projects. Also, at this point we should be thinking more generally about a common physical layer shared between pig and hive. These languages can continue to have different logical plans but it would be desirable that they share common physical plan structure because they both use same map-reduce runtime.

          Show
          Jitendra Nath Pandey added a comment - Dmitry, Vinod There is significant amount of vectorization work in expression evaluation for example, arithmetic expressions or logical expressions or aggregations etc. Many of these expressions are pretty generic and different systems are likely to have similar semantics for these. It should be possible to re-use this code with little change in pig or other systems. It will be required to use same vectorized representation of data in the processing engine to re-use these expressions, but that part of code is also generic and re-usable. I think that could be a good starting point. However, a bunch of the vectorization work is in operator code where we have vectorized version of the hive operators. These operators are closely tied with hive semantics and implementation. Therefore, it will need some restructuring in hive code base as well to generalize these operators for re-use in other projects. Also, at this point we should be thinking more generally about a common physical layer shared between pig and hive. These languages can continue to have different logical plans but it would be desirable that they share common physical plan structure because they both use same map-reduce runtime.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Jitendra,
          I believe physical plan primitives for both Hive and Pig (and potentially others) are going to come in via Tez, as both Pig and Hive want to get off strict MR in the long-term.

          I'll take a crack at extracting what's extractable. Right now Hive's UDAF reaches fairly deeply into this code, as you noted, but I think with a little restructuring this can be factored out.

          Show
          Dmitriy V. Ryaboy added a comment - Jitendra, I believe physical plan primitives for both Hive and Pig (and potentially others) are going to come in via Tez, as both Pig and Hive want to get off strict MR in the long-term. I'll take a crack at extracting what's extractable. Right now Hive's UDAF reaches fairly deeply into this code, as you noted, but I think with a little restructuring this can be factored out.
          Hide
          Eric Hanson added a comment -

          Updated design specification with new section describing the vectorized UDF adaptor (HIVE-4961).

          Show
          Eric Hanson added a comment - Updated design specification with new section describing the vectorized UDF adaptor ( HIVE-4961 ).
          Hide
          Jitendra Nath Pandey added a comment -

          Vectorization work has been committed to trunk. Going forward, all the vectorization work will happen on trunk and vectorization branch will be obsolete.

          Show
          Jitendra Nath Pandey added a comment - Vectorization work has been committed to trunk. Going forward, all the vectorization work will happen on trunk and vectorization branch will be obsolete.
          Hide
          Lars Francke added a comment -

          This is a huge patch and it's hard to see if it changes anything for the end user. As we'd like to keep the Wiki up-to-date it'd be great if someone could comment whether there are any configuration options besides hive.vectorized.execution.enabled or any other things that should be documented.

          Thanks!

          Show
          Lars Francke added a comment - This is a huge patch and it's hard to see if it changes anything for the end user. As we'd like to keep the Wiki up-to-date it'd be great if someone could comment whether there are any configuration options besides hive.vectorized.execution.enabled or any other things that should be documented. Thanks!
          Hide
          Eric Hanson added a comment -

          I've been planning to write some user documentation for this feature. Where do you think would be a good spot in the wiki to include it?

          Show
          Eric Hanson added a comment - I've been planning to write some user documentation for this feature. Where do you think would be a good spot in the wiki to include it?
          Hide
          Lefty Leverenz added a comment -

          Put it in Design Docs (https://cwiki.apache.org/confluence/display/Hive/DesignDocs) until it's released. Later you can move it into the User Docs with a note about which release introduces it. You can either change the file's location in the hierarchy or leave it in place and just link to it from the User Docs section.

          When it goes into User Docs, you have some choices. Does it belong on the Home page or in the Language Manual? If in the Language Manual, do you want it under DML or should it be a stand-alone doc? That depends on what you write and how you want readers to find the doc. You can always add links from other docs to make sure people find it.

          Here's the Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual.

          Of course configuration goes here, perhaps in a subsection under Query Execution: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties. I suggest you make a section in your design doc that's formatted to match the configuration doc, so when the time comes you can just cut & paste.

          Show
          Lefty Leverenz added a comment - Put it in Design Docs ( https://cwiki.apache.org/confluence/display/Hive/DesignDocs ) until it's released. Later you can move it into the User Docs with a note about which release introduces it. You can either change the file's location in the hierarchy or leave it in place and just link to it from the User Docs section. When it goes into User Docs, you have some choices. Does it belong on the Home page or in the Language Manual? If in the Language Manual, do you want it under DML or should it be a stand-alone doc? That depends on what you write and how you want readers to find the doc. You can always add links from other docs to make sure people find it. Here's the Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual . Of course configuration goes here, perhaps in a subsection under Query Execution: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties . I suggest you make a section in your design doc that's formatted to match the configuration doc, so when the time comes you can just cut & paste.
          Hide
          Eric Hanson added a comment -
          Show
          Eric Hanson added a comment - I put initial documentation at: https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution

            People

            • Assignee:
              Jitendra Nath Pandey
              Reporter:
              Jitendra Nath Pandey
            • Votes:
              1 Vote for this issue
              Watchers:
              48 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development