Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8077

Avoid converting timestamps in dropped rows during Parquet scanning

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Backend
    • Epic Color:
      ghx-label-8

      Description

      If flag convert_legacy_hive_parquet_utc_timestamps is true, then every TIMESTAMP value is converted from UTC to local time during Parquet scanning. This is done during column decoding, and Impala materializes every column before calculating the WHERE predicate, so if a timestamp column is not in the predicate, then the conversion is unnecessarily done in rows that fail the predicate.

      Example:
      CREATE TABLE t (id INT, ts TIMESTAMP) STORED AS PARQUET;
      SELECT * FROM t WHERE id = 1;

      Timezone conversion will be done for every 'ts', even if the predicate matches only a single row (lets ignore stat and dictionary filtering). The CPU time of the query above is likely to be dominated by timezone conversion, especially if the query is very selective.

      Note that the same overhead is "normal" if the predicate uses the timestamps column e.g. in
      SELECT * FROM t WHERE ts = "2019.01.14 16:00:00"
      It would be possible to avoid this conversion, but this would be very hacky, so this is out of the scope of this issue.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                csringhofer Csaba Ringhofer
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: