Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.13.0
Description
I have the following csv file: (Note that col_a contains a negative zero value.)
col_a,col_b 0.0,0.0 -0.0,0.0
...and process it via:
from pyarrow import csv, parquet in_csv = 'in.csv' table = csv.read_csv(in_csv) parquet.write_to_dataset(table, root_path='./')
The output parquet file is then loaded into S3 and queried via AWS Athena (i.e. PrestoDB / Hive).
Any query that touches col_a fails with the following error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split {{REDACTED}} (offset=0, length=593): low must be less than or equal to high
As a sanity check, I transformed the csv file to parquet using an AWS Glue Spark Job and I was able to query the output parquet file successfully.
As such, it appears as though the pyarrow writer is producing an invalid parquet file when a column contains at least one instance of 0.0, at least one instance of -0.0, and no other values.
Attachments
Issue Links
- links to