Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1857

[C++][Parquet] ParquetFileReader unable to read files with more than 32767 row groups

    XMLWordPrintableJSON

Details

    Description

      I am using Rust to write Parquet file and read from Python.

      When write_batch with 10000 batch size, reading the Parquet file from Python gives the error below:

      ```

      >>> pd.read_parquet("some.parquet", engine="pyarrow")
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 296, in read_parquet
      return impl.read(path, columns=columns, **kwargs)
      File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 125, in read
      path, columns=columns, **kwargs
      File "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1537, in read_table
      use_pandas_metadata=use_pandas_metadata)
      File "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1262, in read
      use_pandas_metadata=use_pandas_metadata)
      File "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 707, in read
      table = reader.read(**options)
      File "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 337, in read
      use_threads=use_threads)
      File "pyarrow/_parquet.pyx", line 1130, in pyarrow._parquet.ParquetReader.read_all
      File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
      OSError: Unexpected end of stream

      ```

      Also, when using batch size 1 and then read from Python, there is error too: 

      ```

      >>> pd.read_parquet("some.parquet", engine="pyarrow")
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 296, in read_parquet
      return impl.read(path, columns=columns, **kwargs)
      File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 125, in read
      path, columns=columns, **kwargs
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1537, in read_table
      use_pandas_metadata=use_pandas_metadata)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1262, in read
      use_pandas_metadata=use_pandas_metadata)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 707, in read
      table = reader.read(**options)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 337, in read
      use_threads=use_threads)
      File "pyarrow/_parquet.pyx", line 1130, in pyarrow._parquet.ParquetReader.read_all
      File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
      OSError: The file only has 0 columns, requested metadata for column: 6

      ```

      Using batch size 1000 is fine.

      Note that my data has 450047 rows. Schema:

      ```

      message schema

      { REQUIRED INT32 a; REQUIRED INT32 b; REQUIRED INT32 c; REQUIRED INT64 d; REQUIRED INT32 e; REQUIRED BYTE_ARRAY f (UTF8); REQUIRED BOOLEAN g; }

      ```

       

      EDIT: as I add more rows (estimated 80 millions), using batch size 1000 does not work too:

      ```

      >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 296, in read_parquet
      return impl.read(path, columns=columns, **kwargs)
      File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 125, in read
      path, columns=columns, **kwargs
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1537, in read_table
      use_pandas_metadata=use_pandas_metadata)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1262, in read
      use_pandas_metadata=use_pandas_metadata)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 707, in read
      table = reader.read(**options)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 337, in read
      use_threads=use_threads)
      File "pyarrow/_parquet.pyx", line 1130, in pyarrow._parquet.ParquetReader.read_all
      File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
      OSError: The file only has 0 columns, requested metadata for column: 6

      ```

      Unless I am using it wrong (which doesn't seem to be, since the API is simple), this is not usable at all

       

      EDIT: some more logs, using 1000 batch size, a lot of rows:

      ```

      >>> df = pd.read_parquet("ping_pong.parquet", engine="pyarrow")
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 296, in read_parquet
      return impl.read(path, columns=columns, **kwargs)
      File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 125, in read
      path, columns=columns, **kwargs
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1537, in read_table
      use_pandas_metadata=use_pandas_metadata)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1262, in read
      use_pandas_metadata=use_pandas_metadata)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 707, in read
      table = reader.read(**options)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 337, in read
      use_threads=use_threads)
      File "pyarrow/_parquet.pyx", line 1130, in pyarrow._parquet.ParquetReader.read_all
      File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
      OSError: The file only has -959432807 columns, requested metadata for column: 6

      ```

       

      EDIT:

      I wanted to try fastparquet, but seems fastparquet does not support .set_dictionary_enabled(true), so I set it to false.

      Turns out fastparquet is fine, so likely a problem with pyarrow.

      ```

      >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 296, in read_parquet
      return impl.read(path, columns=columns, **kwargs)
      File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 125, in read
      path, columns=columns, **kwargs
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1281, in read_table
      use_pandas_metadata=use_pandas_metadata)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1137, in read
      use_pandas_metadata=use_pandas_metadata)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 605, in read
      table = reader.read(**options)
      File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 253, in read
      use_threads=use_threads)
      File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all
      File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
      OSError: The file only has -580697109 columns, requested metadata for column: 5
      >>> df = pd.read_parquet("data/ping_pong.parquet", engine="fastparquet")

      ```

      Attachments

        1. test.parquet.tgz
          1.50 MB
          Novice
        2. test_2.parquet.tgz
          54.01 MB
          Novice

        Issue Links

          Activity

            People

              wesm Wes McKinney
              novice Novice
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m