[HIVE-18553] Support schema evolution in Parquet Vectorization reader - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Convert to Issue

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.2, 2.4.0, 3.0.0
Fix Version/s: 3.0.0
Component/s: None
Labels:
None

Description

For schema evolution, it includes the following points:
1. column changes
column reorder
column add, column delete
column rename
2. type conversion
low precision to high precision
type to String
For 1st type, current the code is not supporting the column addition operation. Detailed error is as follows:

0: jdbc:hive2://localhost:10000/default> desc test_p;
+-----------+------------+----------+
| col_name  | data_type  | comment  |
+-----------+------------+----------+
| t1        | tinyint    |          |
| t2        | tinyint    |          |
| i1        | int        |          |
| i2        | int        |          |
+-----------+------------+----------+
0: jdbc:hive2://localhost:10000/default> set hive.fetch.task.conversion=none;
0: jdbc:hive2://localhost:10000/default> set hive.vectorized.execution.enabled=true;
0: jdbc:hive2://localhost:10000/default> alter table test_p add columns (ts timestamp);
0: jdbc:hive2://localhost:10000/default> select * from test_p;
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

Following exception is seen in the logs

Caused by: java.lang.IllegalArgumentException: [ts] BINARY is not in the store: [[i1] INT32, [i2] INT32, [t1] INT32, [t2] INT32] 3
        at org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:160) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.buildVectorizedParquetReader(VectorizedParquetRecordReader.java:479) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:432) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:393) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:345) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:88) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:167) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:52) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:229) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:142) ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199) ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185) ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459) ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271) ~[hadoop-mapreduce-client-common-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_121]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121]

For 2nd type operation, non Vectorized Parquet reader leverages existing Parquet String inspector to do the conversion while vectorized path does not.
To support, this JIRA is providing an abstract layer to read the underlying data and convert it to what Hive required for further computing.