Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10480

[Python] Parquet write_table creates gzipped Parquet file, not Parquet with gzip compression

    XMLWordPrintableJSON

Details

    Description

      Writing "foo.parquet.gz" in Arrow 2.0.0 creates a gzipped Parquet file, which Arrow can't read back, while in 1.0.1 it created a Parquet file with gzip compression. Hence I think this is a regression.

      In Arrow 2.0.0:

      > pip freeze
      numpy==1.19.4
      pyarrow==2.0.0
      > python write.py
      Arrow: 2.0.0
      Read/write with PyArrow:
      test.pyarrow.gz: gzip compressed data, from Unix, original size modulo 2^32 630
      Traceback (most recent call last):
        File "write.py", line 12, in <module>
          print(pq.read_table("test.pyarrow.gz"))
        File "/home/lidavidm/Code/twosigma/arrow-regression/venv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1607, in read_table
          dataset = _ParquetDatasetV2(
        File "/home/lidavidm/Code/twosigma/arrow-regression/venv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1452, in __init__
          [fragment], schema=fragment.physical_schema,
        File "pyarrow/_dataset.pyx", line 761, in pyarrow._dataset.Fragment.physical_schema.__get__
        File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
        File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
      OSError: Could not open parquet input source 'test.pyarrow.gz': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file. 

      But in Arrow 1.0.1:

      > pip freeze
      numpy==1.19.4
      pyarrow==1.0.1
      > python write.py
      Arrow: 1.0.1
      Read/write with PyArrow:
      test.pyarrow.gz: Apache Parquet
      pyarrow.Table
      ints: int64 

      Reproduction:

      import pyarrow as pa
      import pyarrow.parquet as pq
      import subprocess
      
      print("Arrow:", pa.__version__)
      print()
      
      print("Read/write with PyArrow:")
      table = pa.table([pa.array(range(4))], names=["ints"])
      pq.write_table(table, "test.pyarrow.gz", compression="GZIP")
      subprocess.check_call(["file", "test.pyarrow.gz"])
      print(pq.read_table("test.pyarrow.gz"))
      

       

      Attachments

        Issue Links

          Activity

            People

              lidavidm David Li
              lidavidm David Li
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h