[ARROW-10480] [Python] Parquet write_table creates gzipped Parquet file, not Parquet with gzip compression - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 3.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/26456

Description

Writing "foo.parquet.gz" in Arrow 2.0.0 creates a gzipped Parquet file, which Arrow can't read back, while in 1.0.1 it created a Parquet file with gzip compression. Hence I think this is a regression.

In Arrow 2.0.0:

> pip freeze
numpy==1.19.4
pyarrow==2.0.0
> python write.py
Arrow: 2.0.0
Read/write with PyArrow:
test.pyarrow.gz: gzip compressed data, from Unix, original size modulo 2^32 630
Traceback (most recent call last):
  File "write.py", line 12, in <module>
    print(pq.read_table("test.pyarrow.gz"))
  File "/home/lidavidm/Code/twosigma/arrow-regression/venv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1607, in read_table
    dataset = _ParquetDatasetV2(
  File "/home/lidavidm/Code/twosigma/arrow-regression/venv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1452, in __init__
    [fragment], schema=fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 761, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Could not open parquet input source 'test.pyarrow.gz': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

But in Arrow 1.0.1:

> pip freeze
numpy==1.19.4
pyarrow==1.0.1
> python write.py
Arrow: 1.0.1
Read/write with PyArrow:
test.pyarrow.gz: Apache Parquet
pyarrow.Table
ints: int64

Reproduction:

import pyarrow as pa
import pyarrow.parquet as pq
import subprocess

print("Arrow:", pa.__version__)
print()

print("Read/write with PyArrow:")
table = pa.table([pa.array(range(4))], names=["ints"])
pq.write_table(table, "test.pyarrow.gz", compression="GZIP")
subprocess.check_call(["file", "test.pyarrow.gz"])
print(pq.read_table("test.pyarrow.gz"))

Attachments

Issue Links

links to

GitHub Pull Request #8659

Activity

People

Assignee:: David Li

Reporter:: David Li

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 03/Nov/20 13:48

Updated:: 11/Jan/23 08:13

Resolved:: 16/Nov/20 20:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h