Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2835

Hive/Impala inconsistency with parquet.column.index.access=false

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 2.3.0
    • Impala 2.6.0
    • Backend
    • None
    • Impala 2.3.0-cdh5.5.1 RELEASE (build 73bf5bc5afbb47aa7eab06cfbf6023ba8cb74f3c)

    Description

      In hive it's possible to map table columns to parquet file fields by name using

      parquet.column.index.access=false

      This is not possible in Impala to create a table with columns mapped by name. Also tables created by hive with parquet.column.index.access=false are not queried by impala correctly. Impala always uses index.access=true mode.

      Steps to reproduce in Impala:

      $ impala-shell -i localhost -d one_off
      
      [localhost:21000] > create table parquet_table (field1 string, field2 string) stored as parquet;
      Query: create table parquet_table (field1 string, field2 string) stored as parquet
      Fetched 0 row(s) in 0.14s
      
      [localhost:21000] > insert into parquet_table values (('f1', 'f2'));
      Query: insert into parquet_table values (('f1', 'f2'))
      Inserted 1 row(s) in 4.89s
      
      [localhost:21000] > select * from parquet_table;
      Query: select * from parquet_table
      +--------+--------+
      | field1 | field2 |
      +--------+--------+
      | f1     | f2     |
      +--------+--------+
      Fetched 1 row(s) in 0.26s
      
      -- find where parquet files are in hdfs
      [localhost:21000] > show files in parquet_table;
      Query: show files in parquet_table
      +---------------------------------------------------------------------------------------------------------------------------+------+-----------+
      | path                                                                                                                      | size | partition |
      +---------------------------------------------------------------------------------------------------------------------------+------+-----------+
      | hdfs://nameservice01/user/hive/warehouse/one_off.db/parquet_table/bf4c8168cfac5dad-5abcf4063e6c53b7_253339204_data.0.parq | 382B |           |
      +---------------------------------------------------------------------------------------------------------------------------+------+-----------+
      Fetched 1 row(s) in 0.01s
      
      -- it's in /user/hive/warehouse/one_off.db/parquet_table
      
      [localhost:21000] > create external table parquet_subset (field2 string) 
      stored as parquet 
      location '/user/hive/warehouse/one_off.db/parquet_table';
      Query: create external table parquet_subset (field2 string) stored as parquet location '/user/hive/warehouse/one_off.db/parquet_table'
      
      Fetched 0 row(s) in 0.17s
      
      [localhost:21000] > select * from parquet_subset;
      Query: select * from parquet_subset
      +--------+
      | field2 |
      +--------+
      | f1     |
      +--------+
      Fetched 1 row(s) in 4.01s
      

      How to create parquet_subset table with a column field2 mapped to column field2 from a parquet file?

      Also I reported this issue in the forum:
      http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/external-table-stored-as-parquet-can-not-use-field-inside-a/m-p/36012

      Attachments

        Issue Links

          Activity

            People

              skye Skye Wanderman-Milne
              epishkin oleksii iepishkin
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: