Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24647

Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.2
    • None
    • None
    • Hadoop and Hive instances running via Amazon EMR.
      Hadoop version 3.2.1
      Hive version 3.1.2

      Queries run via hive CLI.

      OS: Amazon Linux 2

    Description

      A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:

      1. The query SELECT s non-primitive type columns. I have tested this with ARRAY and RECORD types.
      2. The query involves JOIN ing of two or more tables.
      3. The tables are external, and they use the same storage type. This may occur with internal tables as well, but I am unable to test this in the Hive instance I have access to. The storage types I have tested this with are Parquet and Avro. If all tables use Parquet or all use Avro, the error is raised. If there is a mixture of storage types, the query runs successfully.

      I tried to dig through code to find and fix the issue myself, but unfortunately I did not have a complete enough understanding of Hive and its dependencies to pinpoint the issue.

      Further examples and explanation - given the following two table definitions:

      CREATE EXTERNAL TABLE experimental.join_test (
           int_col BIGINT,
           array_col ARRAY <BIGINT>,
           record_col STRUCT <a: BIGINT, b: BIGINT>
      )
      STORED AS AVRO
      LOCATION 's3://s3-bucket/join_test/'
      TBLPROPERTIES (
        'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", "long"]},
                  {"name": "array_col",
                   "type": ["null", {"items": ["null", "long"], "type": "array"}]},
                  {"name": "record_col",
                   "type": ["null",
                            {"fields": [{"name": "a", "type": ["null", "long"]},
                                        {"name": "b", "type": ["null", "long"]}],
                             "name": "record_col_0",
                             "type": "record"}]}],
       "name": "Root",
       "type": "record"}'
      )
      
      CREATE EXTERNAL TABLE experimental.join_test2 (
           int_col2 BIGINT,
           str_col STRING
      )
      STORED AS AVRO
      LOCATION 's3://s3-bucket/join_test2/'
      TBLPROPERTIES (
        'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", "long"]},
                  {"name": "str_col", "type": ["null", "string"]}],
       "name": "Root",
       "type": "record"}'
      )
      

      I can successfully query both of these tables, running the following queries:

      SELECT * FROM experimental.join_test;
      
      1	[1,2]	{"a":1,"b":2}
      2	[3,4]	{"a":3,"b":4}
      3	[5,6]	{"a":5,"b":6}
      
      SELECT * FROM experimental.join_test2;
      
      1	should-appear-after-joining
      2	should-appear-after-joining
      3	should-appear-after-joining
      4	shouldnt-appear-after-joining
      

      I can also join the tables together, so long as I do not select the complex type columns:

      SELECT int_col, int_col2, str_col
      FROM experimental.join_test AS jt
      JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;
      
      3	3	should-appear-after-joining
      2	2	should-appear-after-joining
      1	1	should-appear-after-joining
      

      But as soon as complex columns are introduced to the SELECT clause, an error arises.

      SELECT *
      FROM experimental.george_test AS jt
      JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
      

      I will attach the full stack trace as a file - it is rather large. The lowest-level error message given is slightly different depending on which complex type is involved in raising the error.

      • If it is an ARRAY type, the error message is org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
      • If it is a RECORD type, the error message is org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector

      If the tables are of different types - for example, if the table join_test was stored as Parquet instead of Avro - the query executes without issue.

      From what I can tell by looking at the stack trace, Hive is attempting to run this query as a vectorized query, leading to the error.Some further points of support for this idea:

      1. Vectorization is not supported in queries that involved complex type columns, which lines up with the fact that queries only fail when complex type columns are included in the query.
      2. Explicitly setting hive.vectorized.execution.enabled to 'false' allows these queries to execute and finish normally.

      Attachments

        1. hive_complex_join_err.txt
          28 kB
          George Wood

        Activity

          People

            Unassigned Unassigned
            Geoiv George Wood
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: