Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
7.0.0, 9.0.0
-
None
-
None
-
Ubuntu Linux
pyarrow 9.0.0 installed with pip (manylinux wheel)
Python 3.9.0 from conda-forge
GCC 9.4.0
Description
I'm trying to read a specific large CSV file (the-reddit-climate-change-dataset-comments.csv from this dataset) by batches. This is my code:
import os import pyarrow as pa from pyarrow.csv import open_csv, ReadOptions import pyarrow.parquet as pq filename = "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv" print(f"Reading {filename}...") mmap = pa.memory_map(filename) reader = open_csv(mmap) while True: try: batch = reader.read_next_batch() print(len(batch)) except StopIteration: break
But, after a few batches, I get an exception:
Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv... 1233 1279 1293 --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) Input In [1], in <cell line: 14>() 13 while True: 14 try: ---> 15 batch = reader.read_next_batch() 16 print(len(batch)) 17 except StopIteration: File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in pyarrow.lib.RecordBatchReader.read_next_batch() File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status() ArrowInvalid: CSV parser got out of sync with chunker
I have tried changing the block size, but I always end up with that error sooner or later:
- With read_options=ReadOptions(block_size=10_000), it reads 1 batch of 11 rows and then crashes
- With 100_000, 103 rows and then crashes
- 1_000_000: 1164 rows and then crashes
- 10_000_000: 12370 rows and then crashes
I am not sure what else to try here. According to the C++ source code, this "should not happen".
I have tried with pyarrow 7.0 and 9.0, identical result and traceback.