Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0
Description
Writing "foo.parquet.gz" in Arrow 2.0.0 creates a gzipped Parquet file, which Arrow can't read back, while in 1.0.1 it created a Parquet file with gzip compression. Hence I think this is a regression.
In Arrow 2.0.0:
> pip freeze numpy==1.19.4 pyarrow==2.0.0 > python write.py Arrow: 2.0.0 Read/write with PyArrow: test.pyarrow.gz: gzip compressed data, from Unix, original size modulo 2^32 630 Traceback (most recent call last): File "write.py", line 12, in <module> print(pq.read_table("test.pyarrow.gz")) File "/home/lidavidm/Code/twosigma/arrow-regression/venv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1607, in read_table dataset = _ParquetDatasetV2( File "/home/lidavidm/Code/twosigma/arrow-regression/venv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1452, in __init__ [fragment], schema=fragment.physical_schema, File "pyarrow/_dataset.pyx", line 761, in pyarrow._dataset.Fragment.physical_schema.__get__ File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Could not open parquet input source 'test.pyarrow.gz': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
But in Arrow 1.0.1:
> pip freeze numpy==1.19.4 pyarrow==1.0.1 > python write.py Arrow: 1.0.1 Read/write with PyArrow: test.pyarrow.gz: Apache Parquet pyarrow.Table ints: int64
Reproduction:
import pyarrow as pa import pyarrow.parquet as pq import subprocess print("Arrow:", pa.__version__) print() print("Read/write with PyArrow:") table = pa.table([pa.array(range(4))], names=["ints"]) pq.write_table(table, "test.pyarrow.gz", compression="GZIP") subprocess.check_call(["file", "test.pyarrow.gz"]) print(pq.read_table("test.pyarrow.gz"))
Attachments
Issue Links
- links to