Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.2
-
None
-
None
-
Hadoop and Hive instances running via Amazon EMR.
Hadoop version 3.2.1
Hive version 3.1.2Queries run via hive CLI.
OS: Amazon Linux 2
Description
A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:
- The query SELECT s non-primitive type columns. I have tested this with ARRAY and RECORD types.
- The query involves JOIN ing of two or more tables.
- The tables are external, and they use the same storage type. This may occur with internal tables as well, but I am unable to test this in the Hive instance I have access to. The storage types I have tested this with are Parquet and Avro. If all tables use Parquet or all use Avro, the error is raised. If there is a mixture of storage types, the query runs successfully.
I tried to dig through code to find and fix the issue myself, but unfortunately I did not have a complete enough understanding of Hive and its dependencies to pinpoint the issue.
Further examples and explanation - given the following two table definitions:
CREATE EXTERNAL TABLE experimental.join_test ( int_col BIGINT, array_col ARRAY <BIGINT>, record_col STRUCT <a: BIGINT, b: BIGINT> ) STORED AS AVRO LOCATION 's3://s3-bucket/join_test/' TBLPROPERTIES ( 'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", "long"]}, {"name": "array_col", "type": ["null", {"items": ["null", "long"], "type": "array"}]}, {"name": "record_col", "type": ["null", {"fields": [{"name": "a", "type": ["null", "long"]}, {"name": "b", "type": ["null", "long"]}], "name": "record_col_0", "type": "record"}]}], "name": "Root", "type": "record"}' )
CREATE EXTERNAL TABLE experimental.join_test2 ( int_col2 BIGINT, str_col STRING ) STORED AS AVRO LOCATION 's3://s3-bucket/join_test2/' TBLPROPERTIES ( 'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", "long"]}, {"name": "str_col", "type": ["null", "string"]}], "name": "Root", "type": "record"}' )
I can successfully query both of these tables, running the following queries:
SELECT * FROM experimental.join_test; 1 [1,2] {"a":1,"b":2} 2 [3,4] {"a":3,"b":4} 3 [5,6] {"a":5,"b":6}
SELECT * FROM experimental.join_test2; 1 should-appear-after-joining 2 should-appear-after-joining 3 should-appear-after-joining 4 shouldnt-appear-after-joining
I can also join the tables together, so long as I do not select the complex type columns:
SELECT int_col, int_col2, str_col FROM experimental.join_test AS jt JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2; 3 3 should-appear-after-joining 2 2 should-appear-after-joining 1 1 should-appear-after-joining
But as soon as complex columns are introduced to the SELECT clause, an error arises.
SELECT * FROM experimental.george_test AS jt JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
I will attach the full stack trace as a file - it is rather large. The lowest-level error message given is slightly different depending on which complex type is involved in raising the error.
- If it is an ARRAY type, the error message is org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
- If it is a RECORD type, the error message is org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
If the tables are of different types - for example, if the table join_test was stored as Parquet instead of Avro - the query executes without issue.
From what I can tell by looking at the stack trace, Hive is attempting to run this query as a vectorized query, leading to the error.Some further points of support for this idea:
- Vectorization is not supported in queries that involved complex type columns, which lines up with the fact that queries only fail when complex type columns are included in the query.
- Explicitly setting hive.vectorized.execution.enabled to 'false' allows these queries to execute and finish normally.