Description
IMPALA-5050 adds support for reading int64 encoded Parquet timestamps. These columns have int64 physical type, and converted/logical types has to be used to differentiate them from BIGINTs. These columns can be read both as BIGINTs and TIMESTAMPs depending on the table's schema.
CREATE TABLE LIKE PARQUET could also convert these columns to TIMESTAMP instead of BIGINT, but I decided to postpone adding this feature for two reasons:
1. It could break the following possible workflow:
- generate Parquet files (that contain int64 timestamps) with some tool
- use Impala's CREATE TABLE LIKE PARQUET + LOAD DATA to make it accessible as a table
- run some queries that rely on interpreting these columns as integers
CAST (col as BIGINT) in the query would make this even worse, as it would convert timestamp to unix time in seconds instead of micros/millis without any warning.
2. Adding support for int64 timestamps with nanoseconds precision will need Impala's parquet-hadoop-bundle dependency to be bumped to a new major version, which may contain incompatible API changes.
Note that parquet-hadoop-bundle is only used in CREATE TABLE LIKE PARQUET. The C++ parts of Impala only rely on parquet.thrift, which can be updated more easily.
Attachments
Issue Links
- relates to
-
IMPALA-5050 Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS to the parquet scanner
- Resolved