Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Invalid
-
0.15.1
-
None
-
None
-
Linux, Ubuntu 18.04, arrow/parquet 0.15.1 from instructions https://arrow.apache.org/install/
Description
When I set TZ=CST-8, or other timezone on Linux time columns are corrupted in my resulting parquet file.
Below are the calls I use to define my schema:
PrimitiveNode::Make( columnName, Repetition::REQUIRED, LogicalType::Timestamp( true, LogicalType::TimeUnit::MICROS, false, false ), ::parquet::Type::INT64 ) ); PrimitiveNode::Make( columnName, repetition, LogicalType::Time( true, LogicalType::TimeUnit::MICROS ), ::parquet::Type::INT64 ) );
I use an Int64Writer for both types. When reading, in this case using pandas with pyarrow, but also in C++, I get the following exception:
File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data Deserializing page header failed.
Seems as if the column header must be defining a timestamp+timezone even though I manually set is_adjusted_to_utc.