[ARROW-18084] [Python] "CSV parser got out of sync with chunker" on subsequent batches regardless of block size - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 7.0.0, 9.0.0
Fix Version/s: None
Component/s: C++, Python
Labels:
None
Environment:
Ubuntu Linux
pyarrow 9.0.0 installed with pip (manylinux wheel)
Python 3.9.0 from conda-forge
GCC 9.4.0

External issue URL:
https://github.com/apache/arrow/issues/33282

Description

I'm trying to read a specific large CSV file (the-reddit-climate-change-dataset-comments.csv from this dataset) by batches. This is my code:

import os

import pyarrow as pa
from pyarrow.csv import open_csv, ReadOptions
import pyarrow.parquet as pq

filename = "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv"

print(f"Reading {filename}...")
mmap = pa.memory_map(filename)

reader = open_csv(mmap)
while True:
    try:
        batch = reader.read_next_batch()
        print(len(batch))
    except StopIteration:
        break

But, after a few batches, I get an exception:

Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv...
1233
1279
1293

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Input In [1], in <cell line: 14>()
     13 while True:
     14     try:
---> 15         batch = reader.read_next_batch()
     16         print(len(batch))
     17     except StopIteration:

File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in pyarrow.lib.RecordBatchReader.read_next_batch()

File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: CSV parser got out of sync with chunker

I have tried changing the block size, but I always end up with that error sooner or later:

With read_options=ReadOptions(block_size=10_000), it reads 1 batch of 11 rows and then crashes
With 100_000, 103 rows and then crashes
1_000_000: 1164 rows and then crashes
10_000_000: 12370 rows and then crashes

I am not sure what else to try here. According to the C++ source code, this "should not happen".

I have tried with pyarrow 7.0 and 9.0, identical result and traceback.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screenshot 2022-10-18 at 10-11-29 JupyterLab · Orchest.png
18/Oct/22 08:11
108 kB
Juan Luis Cano Rodríguez

Activity

People

Assignee:: Unassigned

Reporter:: Juan Luis Cano Rodríguez

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Oct/22 08:13

Updated:: 11/Jan/23 11:58