[HIVE-23889] Empty bucket files are inserted with invalid schema after HIVE-21784 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

~~HIVE-21784~~ uses a new WriterOptions instead of the field in OrcRecordUpdater:
https://github.com/apache/hive/commit/f62379ba279f41b843fcd5f3d4a107b6fcd04dec#diff-bb969e858664d98848960a801fd58b5cR580-R583

so in this scenario, the overwrite creates an empty bucket file, which is fine as that was the intention of that patch, but it creates that with invalid schema:

CREATE TABLE test_table (
   cda_id             int,
   cda_run_id         varchar(255),
   cda_load_ts        timestamp,
   global_party_id    string)
PARTITIONED BY (
   cda_date           int,
   cda_job_name       varchar(12))
CLUSTERED BY (cda_id) 
INTO 2 BUCKETS
STORED AS ORC;


INSERT OVERWRITE TABLE test_table PARTITION (cda_date = 20200601 , cda_job_name = 'core_base')
SELECT 1 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, 'global_party_id' global_party_id
UNION ALL
SELECT 2 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, 'global_party_id' global_party_id;

ALTER TABLE test_table ADD COLUMNS (group_id string) CASCADE ;

INSERT OVERWRITE TABLE test_table PARTITION (cda_date = 20200601 , cda_job_name = 'core_base')
SELECT 1 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, 'global_party_id' global_party_id, 'group_id' as group_id;

because of ~~HIVE-21784~~, the new empty bucket_00000 shows this schema in orc dump:

Type: struct<_col0:int,_col1:varchar(255),_col2:timestamp,_col3:string,_col4:string>

instead of:

Type: struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<cda_id:int,cda_run_id:varchar(255),cda_load_ts:timestamp,global_party_id:string,group_id:string>>

and this could lead to problems later, when hive tries to look into the file during split generation

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-23889.01.patch
21/Jul/21 11:58
1 kB
László Bodor

Issue Links

fixes

HIVE-23758 OrcInputFormat.getSargColumnNames might be more failsafe in case of schema mismatch

Resolved

is caused by

HIVE-21784 Insert overwrite on an acid (not mm) table is ineffective if the input is empty

Closed

Is contained by

HIVE-26751 Bug Fixes and Improvements for 3.2.0 release

Open

is part of

HIVE-22538 RS deduplication does not always enforce hive.optimize.reducededuplication.min.reducer

Closed

Activity

People

Assignee:: László Bodor

Reporter:: László Bodor

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 21/Jul/20 09:02

Updated:: 17/Nov/22 13:14

Resolved:: 21/Jul/20 09:21