Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-886

Always display HBase cols in same order as CREATE TABLE statement

    XMLWordPrintableJSON

Details

    Description

      I noticed a discrepancy with Hive, in how Impala handles column order for HBase tables.
      I think it would be preferable to use the same behavior as Hive, otherwise life becomes
      more complicated for anyone doing INSERT or SELECT * with an HBase table through Impala.
      (And I have to add caveats and usage notes in the docs.)

      Repro:

      In HBase shell, create a table with a single column family. I think most Impala tests use 1 column family per column, where you won't notice this behavior.

      hbase(main):008:0> create 'sample_data_fast','cols'
      0 row(s) in 71.8750 seconds

      In Hive shell, create a mapping table. Notice how DESCRIBE repeats back the columns in the same order as in CREATE TABLE.

      hive> create external table sample_data_fast (id string, val int, zfill string, name string, assertion boolean)
      > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      > WITH SERDEPROPERTIES (
      > "hbase.columns.mapping" =
      > ":key,cols:val,cols:zfill,cols:name,cols:assertion")
      > TBLPROPERTIES("hbase.table.name" = "sample_data_fast")
      > ;
      OK
      Time taken: 1.7 seconds
      hive> desc sample_data_fast;
      OK
      id string from deserializer
      val int from deserializer
      zfill string from deserializer
      name string from deserializer
      assertion boolean from deserializer
      Time taken: 0.302 seconds

      Now try the same DESCRIBE in impala-shell. The key column (id) is listed first. Then all the other columns, part of the same column family, are listed in alphabetical order rather than the order from CREATE TABLE:

      [localhost:21000] > desc sample_data_fast;
      Query: describe sample_data_fast
      -------------------------

      name type comment

      -------------------------

      id string  
      assertion boolean  
      name string  
      val int  
      zfill string  

      -------------------------
      Returned 5 row(s) in 0.02s

      Thus if you already had Hive code that was doing SELECT * from an HBase table like this, you would get a different result set (different column order) in Impala.
      If you tried to copy from an HDFS table via 'INSERT INTO hbase_table SELECT * FROM hdfs_table', you would get an error because the columns don't match. If you made a separate column family for each column, the discrepancy is masked because you need more than one column per column family to experience the alphabetical ordering.

      Since Hive is preserving the column order, the relevant info must be there in the metastore.

      Attachments

        Activity

          People

            csringhofer Csaba Ringhofer
            jrussell John Russell
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: