[ARROW-11269] [Rust] Unable to read Parquet file because of mismatch in column-derived and embedded schemas - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.1, 4.0.0
Component/s: Rust
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/27170

Description

The issue seems to stem from the new(-ish) behavior of the Arrow Parquet reader where the embedded arrow schema is used instead of deriving the schema from the Parquet columns.

However it seems like some cases still derive the schema type from the column types, leading to the Arrow record batch reader erroring out that the column types must match the schema types.

In our case, the column type is an int96 datetime (ns) type, and the Arrow type in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, Some("UTC")). However, the code that constructs the Arrays seems to re-derive this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because the Parquet schema has no timezone information). And so, Parquet files that we were able to read successfully with our branch of Arrow circa October are now unreadable.

I've attached an example of a Parquet file that demonstrates the problem. This file was created in Python (as most of our Parquet files are).

I've also attached a sample Rust program that will demonstrate the error.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet
16/Jan/21 16:23
918 kB
Max Burke
main.rs
16/Jan/21 16:37
0.9 kB
Max Burke

Issue Links

links to

GitHub Pull Request #9253

Activity

People

Assignee:: Neville Dipale

Reporter:: Max Burke

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Jan/21 16:29

Updated:: 11/Jan/23 08:18

Resolved:: 20/Jan/21 04:05

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 50m