Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
None
-
None
Description
Writing really long strings from pyarrow causes exception in fastparquet read.
Traceback (most recent call last): File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module> read_fastparquet() File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in read_fastparquet dff = pf.to_pandas(['A']) File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 426, in to_pandas index=index, assign=parts) File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 258, in read_row_group scheme=self.file_scheme) File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 344, in read_row_group cats, selfmade, assign=assign) File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 321, in read_row_group_arrays catdef=out.get(name+'-catdef', None)) File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 235, in read_col skip_nulls, selfmade=selfmade) File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 99, in read_data_page raw_bytes = _read_page(f, header, metadata) File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 31, in _read_page page_header.uncompressed_page_size) AssertionError: found 175532 raw bytes (expected 200026)
If written with compression, it reports compression errors instead:
SNAPPY: snappy.UncompressError: Error while decompressing: invalid input GZIP: zlib.error: Error -3 while decompressing data: incorrect header check
Minimal code to reproduce:
import os import pandas as pd import pyarrow import pyarrow.parquet as arrow_pq from fastparquet import ParquetFile # data to generate ROW_LENGTH = 40000 # decreasing below 32750ish eliminates exception N_ROWS = 10 # file write params ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is read (e.g. Nones) FILENAME = 'test.parquet' def write_arrow(): df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]}) if os.path.isfile(FILENAME): os.remove(FILENAME) arrow_table = pyarrow.Table.from_pandas(df) arrow_pq.write_table(arrow_table, FILENAME, use_dictionary=False, compression='NONE', row_group_size=ROW_GROUP_SIZE) def read_arrow(): print "arrow:" table2 = arrow_pq.read_table(FILENAME) print table2.to_pandas().head() def read_fastparquet(): print "fastparquet:" pf = ParquetFile(FILENAME) dff = pf.to_pandas(['A']) print dff.head() write_arrow() read_arrow() read_fastparquet()
Versions:
fastparquet==0.1.6
pyarrow==0.10.0
pandas==0.22.0
sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'
Also opened issue here: https://github.com/dask/fastparquet/issues/375