[ARROW-5562] [C++][Parquet] parquet writer does not handle negative zero correctly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.15.0
Component/s: C++, Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/22006

Description

I have the following csv file: (Note that col_a contains a negative zero value.)

col_a,col_b
0.0,0.0
-0.0,0.0

...and process it via:

from pyarrow import csv, parquet
in_csv = 'in.csv'
table = csv.read_csv(in_csv)
parquet.write_to_dataset(table, root_path='./')

The output parquet file is then loaded into S3 and queried via AWS Athena (i.e. PrestoDB / Hive).

Any query that touches col_a fails with the following error:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split {{REDACTED}} (offset=0, length=593): low must be less than or equal to high

As a sanity check, I transformed the csv file to parquet using an AWS Glue Spark Job and I was able to query the output parquet file successfully.

As such, it appears as though the pyarrow writer is producing an invalid parquet file when a column contains at least one instance of 0.0, at least one instance of -0.0, and no other values.

Attachments

Issue Links

links to

GitHub Pull Request #5375

Activity

People

Assignee:: Wes McKinney

Reporter:: Bob Briody

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Jun/19 23:21

Updated:: 11/Jan/23 07:41

Resolved:: 16/Sep/19 14:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m