[HIVE-16291] Hive fails when unions a parquet table with itself - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0
Component/s: Hive
Labels:
None

Description

Reproduce commands:

create table tst_unin (col1 int) partitioned by (p_tdate int) stored as parquet;
insert into tst_unin partition (p_tdate=201603) values (20160312), (20160310);
insert into tst_unin partition (p_tdate=201604) values (20160412), (20160410);
select count(*) from (select tst_unin.p_tdate from tst_unin where tst_unin.col1=20160302 union all select tst_unin.p_tdate from tst_unin) t1;

The table is stored in Parquet format, which is a columnar file format. Hive tries to push the query predicates to the table scan operators so that only the needed columns are read. This is done by adding the needed column IDs into job configuration with property "hive.io.file.readcolumn.ids".

In above case, the query unions the result of 2 subqueries, which select data from one same table. The first subquery doesn't need any column from Parquet file, while the second subquery needs a column "col1". Hive has a bug here, it finally set "hive.io.file.readcolumn.ids" to a value like "0,,0", which method ColumnProjectionUtils.getReadColumnIDs cannot parse.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-16291.1.patch
24/Mar/17 13:57
2 kB
Yibing Shi
HIVE-16291.2.patch
06/Apr/17 05:21
2 kB
Yibing Shi

Activity

People

Assignee:: Yibing Shi

Reporter:: Yibing Shi

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Mar/17 13:56

Updated:: 22/May/18 23:59

Resolved:: 07/Apr/17 14:07