Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16937

INFORMATION_SCHEMA usability: everything is currently a string

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • None
    • None

    Description

      HIVE-1010 adds an information schema to Hive, also taking the opportunity to expose some non-standard but valuable things like statistics in a SYS table.

      A challenge I have noted with the SYS table is that all statistic counts are exposed as string types rather than numerics.

      hive> show create table sys.tab_col_stats;
      OK
      CREATE TABLE `sys.tab_col_stats`(
        `cs_id` string COMMENT 'from deserializer',
        `db_name` string COMMENT 'from deserializer',
        `table_name` string COMMENT 'from deserializer',
        `column_name` string COMMENT 'from deserializer',
        `column_type` string COMMENT 'from deserializer',
        `tbl_id` string COMMENT 'from deserializer',
        `long_low_value` string COMMENT 'from deserializer',
        `long_high_value` string COMMENT 'from deserializer',
        `double_high_value` string COMMENT 'from deserializer',
        `double_low_value` string COMMENT 'from deserializer',
        `big_decimal_low_value` string COMMENT 'from deserializer',
        `big_decimal_high_value` string COMMENT 'from deserializer',
        `num_nulls` string COMMENT 'from deserializer',
        `num_distincts` string COMMENT 'from deserializer',
        `avg_col_len` string COMMENT 'from deserializer',
        `max_col_len` string COMMENT 'from deserializer',
        `num_trues` string COMMENT 'from deserializer',
        `num_falses` string COMMENT 'from deserializer',
        `last_analyzed` string COMMENT 'from deserializer')
      ROW FORMAT SERDE
        'org.apache.hive.storage.jdbc.JdbcSerDe'
      STORED BY
        'org.apache.hive.storage.jdbc.JdbcStorageHandler'
      

      So you might run this query to try and find the column(s) which have the most distinct values.

      select
        db_name, table_name, column_name
      from
        sys.tab_col_stats
      where
        num_distincts = ( select max(num_distincts) from sys.tab_col_stats );
      

      Unfortunately this maximum is based on string sorting so it's not likely what you really want.

      It would be better to use numeric types where appropriate such as all the numbers in tab_col_stats, and most likely bigints should be used for stats like # rows, etc.

      Attachments

        1. HIVE-16937.1.patch
          22 kB
          Gunther Hagleitner
        2. HIVE-16937.2.patch
          25 kB
          Gunther Hagleitner

        Issue Links

          Activity

            People

              hagleitn Gunther Hagleitner
              cartershanklin Carter Shanklin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: