Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
Impala 1.3
Description
I noticed a discrepancy with Hive, in how Impala handles column order for HBase tables.
I think it would be preferable to use the same behavior as Hive, otherwise life becomes
more complicated for anyone doing INSERT or SELECT * with an HBase table through Impala.
(And I have to add caveats and usage notes in the docs.)
Repro:
In HBase shell, create a table with a single column family. I think most Impala tests use 1 column family per column, where you won't notice this behavior.
hbase(main):008:0> create 'sample_data_fast','cols'
0 row(s) in 71.8750 seconds
In Hive shell, create a mapping table. Notice how DESCRIBE repeats back the columns in the same order as in CREATE TABLE.
hive> create external table sample_data_fast (id string, val int, zfill string, name string, assertion boolean)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
> "hbase.columns.mapping" =
> ":key,cols:val,cols:zfill,cols:name,cols:assertion")
> TBLPROPERTIES("hbase.table.name" = "sample_data_fast")
> ;
OK
Time taken: 1.7 seconds
hive> desc sample_data_fast;
OK
id string from deserializer
val int from deserializer
zfill string from deserializer
name string from deserializer
assertion boolean from deserializer
Time taken: 0.302 seconds
Now try the same DESCRIBE in impala-shell. The key column (id) is listed first. Then all the other columns, part of the same column family, are listed in alphabetical order rather than the order from CREATE TABLE:
[localhost:21000] > desc sample_data_fast;
Query: describe sample_data_fast
-------------------------
name | type | comment |
-------------------------
id | string | |
assertion | boolean | |
name | string | |
val | int | |
zfill | string |
-------------------------
Returned 5 row(s) in 0.02s
Thus if you already had Hive code that was doing SELECT * from an HBase table like this, you would get a different result set (different column order) in Impala.
If you tried to copy from an HDFS table via 'INSERT INTO hbase_table SELECT * FROM hdfs_table', you would get an error because the columns don't match. If you made a separate column family for each column, the discrepancy is masked because you need more than one column per column family to experience the alphabetical ordering.
Since Hive is preserving the column order, the relevant info must be there in the metastore.