Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21940

Metastore: Postgres text <-> clob mismatch for PARTITION_PARAMS/PARAM_VALUE

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 4.0.0-alpha-1
    • None
    • None

    Description

      Summary: this is a fix for a regression introduced by HIVE-20833/HIVE-20221, fixed in the same way as HIVE-16667 earlier

      This issue was found while running sqoop/hive tests on a cluster with hive with postgres metastore, and it turned out the problem is that PARAM_VALUE is handled as it was CLOB but it's a text, so extractSqlClob returns it as is.

      It's reproducible on cluster by the following statements:

      USE default;
      drop table if exists my_table;
      create external table my_table (col1 int, col3 int) partitioned by (col2 string) STORED AS TEXTFILE;
      insert into my_table VALUES(11,201,"F");
      SELECT pp.* FROM sys.partition_params pp join sys.partitions p on p.part_id = pp.part_id join sys.tbls t on t.tbl_id = p.tbl_id where t.tbl_name = "my_table";
      

      sys query results in:

      +-------------+------------------------+-----------------+
      | pp.part_id  |      pp.param_key      | pp.param_value  |
      +-------------+------------------------+-----------------+
      | 151         | rawDataSize            | 28629           |
      | 151         | numRows                | 28628           |
      | 151         | transient_lastDdlTime  | 28627           |
      | 151         | COLUMN_STATS_ACCURATE  | 28626           |
      | 151         | numFiles               | 28625           |
      | 151         | totalSize              | 28622           |
      +-------------+------------------------+-----------------+
      

      Seems like (propably) since HIVE-20833/HIVE-20221 there is an inconvenience while using PARTITION_PARAMS/PARAM_VALUE, because in postgres there is no such type as CLOB, and metastore simply saves large object ids into this field.

      In case of a direct metastore query (from hive's sys schema, but the same result for direct postgres), it shows the result above (see sys query output). This is an issue when hive treats these ids as they were real values, but they are obviously not correct, and this causes various failures (e.g. using serde parameter serialization.format=28392)

      param_value values above are large object ids, according to pg_dump

      151 COLUMN_STATS_ACCURATE 28626
      SELECT pg_catalog.lo_open('28626', 131072);
      SELECT pg_catalog.lowrite(0, '\x7b2242415349435f5354415453223a2274727565222c22434f4c554d4e5f5354415453223a7b22636f6c31223a2274727565222c22636f6c33223a2274727565227d7d');
      SELECT pg_catalog.lo_close(0);
      
      

      decoded large object value:

      {"BASIC_STATS":"true","COLUMN_STATS":{"col1":"true","col3":"true"}}
      

      Attachments

        1. HIVE-21940.repro.patch
          13 kB
          László Bodor
        2. HIVE-21940.02.patch
          0.9 kB
          László Bodor
        3. HIVE-21940.01.patch
          13 kB
          László Bodor
        4. HIVE-21940.01.patch
          13 kB
          László Bodor

        Activity

          People

            abstractdog László Bodor
            abstractdog László Bodor
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: