Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23889

Empty bucket files are inserted with invalid schema after HIVE-21784

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      HIVE-21784 uses a new WriterOptions instead of the field in OrcRecordUpdater:
      https://github.com/apache/hive/commit/f62379ba279f41b843fcd5f3d4a107b6fcd04dec#diff-bb969e858664d98848960a801fd58b5cR580-R583

      so in this scenario, the overwrite creates an empty bucket file, which is fine as that was the intention of that patch, but it creates that with invalid schema:

      CREATE TABLE test_table (
         cda_id             int,
         cda_run_id         varchar(255),
         cda_load_ts        timestamp,
         global_party_id    string)
      PARTITIONED BY (
         cda_date           int,
         cda_job_name       varchar(12))
      CLUSTERED BY (cda_id) 
      INTO 2 BUCKETS
      STORED AS ORC;
      
      
      INSERT OVERWRITE TABLE test_table PARTITION (cda_date = 20200601 , cda_job_name = 'core_base')
      SELECT 1 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, 'global_party_id' global_party_id
      UNION ALL
      SELECT 2 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, 'global_party_id' global_party_id;
      
      ALTER TABLE test_table ADD COLUMNS (group_id string) CASCADE ;
      
      INSERT OVERWRITE TABLE test_table PARTITION (cda_date = 20200601 , cda_job_name = 'core_base')
      SELECT 1 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, 'global_party_id' global_party_id, 'group_id' as group_id;
      

      because of HIVE-21784, the new empty bucket_00000 shows this schema in orc dump:

      Type: struct<_col0:int,_col1:varchar(255),_col2:timestamp,_col3:string,_col4:string>
      

      instead of:

      Type: struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<cda_id:int,cda_run_id:varchar(255),cda_load_ts:timestamp,global_party_id:string,group_id:string>>
      

      and this could lead to problems later, when hive tries to look into the file during split generation

      Attachments

        1. HIVE-23889.01.patch
          1 kB
          László Bodor

        Issue Links

          Activity

            People

              abstractdog László Bodor
              abstractdog László Bodor
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: