Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38140

Desc column stats (min, max) for timestamp type is not consistent with the value due to time zone difference

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.1.2, 3.2.1
    • 3.3.0
    • SQL
    • None

    Description

      Currently timestamp column's stats (min/max) are stored in UTC in metastore, and when desc its min/max column stats, they are also shown in UTC.

      As a result, for users not in UTC, the column stats (shown to users) are not consistent with the actual value, which causes confusion.

      For example:

      spark-sql> create table tab_ts_master (ts timestamp) using parquet;
      
      spark-sql> insert into tab_ts_master values make_timestamp(2022, 1, 1, 0, 0, 1.123456), make_timestamp(2022, 1, 3, 0, 0, 2.987654);
      
      spark-sql> select * from tab_ts_master;
      2022-01-01 00:00:01.123456
      2022-01-03 00:00:02.987654
      
      spark-sql> set spark.sql.session.timeZone;
      spark.sql.session.timeZone	Asia/Shanghai
      
      spark-sql> analyze table tab_ts_master compute statistics for all columns;
      
      spark-sql> desc formatted tab_ts_master ts;
      col_name	ts
      data_type	timestamp
      comment	NULL
      min	2021-12-31 16:00:01.123456
      max	2022-01-02 16:00:02.987654
      num_nulls	0
      distinct_count	2
      avg_col_len	8
      max_col_len	8
      histogram	NULL
      

      Attachments

        Activity

          People

            zhenhuawang Zhenhua Wang
            zhenhuawang Zhenhua Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: