Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
Impala 3.3.0
-
None
-
ghx-label-1
Description
This is the desired use case:
1. Create an ORC table TBL1 with a DATE column.
2. Create an ORC table TBL2 with a TIMESTAMP column that has the same location as TBL1.
3. Insert some DATE values into TBL1 and some TIMESTAMP values into TBL2.
4. select from TBL1 returns both DATE and TIMESTAMP values (converted to DATE).
5. select from TBL2 returns both DATE and TIMESTAMPS values. The DATE values are converted to TIMESTAMP.
Without this feature Impala return an error:
ERROR: Type mismatch: table column DATE is map to column timestamp in ORC file 'hdfs://localhost:20500/test-warehouse/orc_date_tbl/000000_0_copy_1'
Note:
With https://issues.apache.org/jira/browse/IMPALA-8801 implementing Date type for ORC it is possible to read date values in ORC format. However, writing is still not supported and has to be done by Hive.
Let me copy-paste a code review comment from IMPALA-8801 as a suggestion for the implementation:
We can modify OrcTimestampReader to support reading orc::TimestampVectorBatch into Date type slots. In its constructor it knows which kind of slots (timestamp or date) it's writting to. So in ReadValue() it can have different behaviors based on different modes (timestamp values => timestamp slots / timestamp values => date slots). We can do the same on OrcDateColumnReader to let it support reading ORC Date values into Timestamp type slots.
Note that the life cycle of a OrcColumnReader is within the life cycle of the HdfsOrcScanner which only reads a split of an ORC file, and an ORC file can't have two types for one column (e.g. column1 is timestamp in stripe1 and is date in stripe2). So we don't need to deal with different batch types in UpdateInputBatch().
BTW, It'd be better to add test coverage for this type compactibility check in test_scanners.py (See TestOrc.test_type_conversions).