[HIVE-11092] First delta of an ORC ACID table contains non-descriptive schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Hive
Labels:
- orc
- orcfile
- transaction
- transactions

Release Note:
ORC Acid delta files contain consistent schemas.

Description

I've been reading ORC ACID data that backs transactional tables from a process external to Hive. Initially I tried to use 'schema on read' but found some inconsistencies in the schema returned from the initial delta file and subsequent delta and base files. To reproduce the issue by example:

CREATE TABLE base_table ( id int, message string )
  PARTITIONED BY ( continent string, country string )
  CLUSTERED BY (id) INTO 1 BUCKETS
  STORED AS ORC
  TBLPROPERTIES ('transactional' = 'true');
  
INSERT INTO TABLE base_table PARTITION (continent = 'Asia', country = 'India')
VALUES (1, 'x'), (2, 'y'), (3, 'z');

UPDATE base_table SET message = 'updated' WHERE id = 1;

Now examining the raw data with the orcfiledump utility (edited for brevity):

cd hive/warehouse/base_table/continent=Asia/country=India/

hive --orcfiledump delta_0000001_0000001/bucket_00000
Type: struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<_col0:int,_col1:string>>    
        
hive --orcfiledump delta_0000002_0000002/bucket_00000
Type: struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<id:int,message:string>>

The row schema for the first delta that resulted from the inserts has its field names erased: row:struct<_col0:int,_col1:string>, whereas the delta for the update reports the correct schema: row:struct<id:int,message:string>. I have also checked this with my own reader code so am confident that FileDump is not at fault.

I believe that the row field names, and hence schema, should be consistent across all ORC files in the ACID data set. This will enable schema on read with field access by name (not index), which is currently not possible. Therefore I'd like to get this issue resolved.

I'm happy to work on this, however after working through OrcRecordUpdater and FileSinkOperator and related tests I've failed to reproduce or isolate the issue at a smaller scale. I'd be grateful for some suggestions on where to look next.

Attachments

Issue Links

is related to

HIVE-4243 Fix column names in FileSinkOperator

Closed

relates to

HIVE-15190 Field names are not preserved in ORC files written with ACID

Closed

Activity

People

Assignee:: Elliot West

Reporter:: Elliot West

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Jun/15 11:33

Updated:: 13/Nov/16 22:21