Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18084

[Python] "CSV parser got out of sync with chunker" on subsequent batches regardless of block size

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.0.0, 9.0.0
    • None
    • C++, Python
    • None
    • Ubuntu Linux
      pyarrow 9.0.0 installed with pip (manylinux wheel)
      Python 3.9.0 from conda-forge
      GCC 9.4.0

    Description

      I'm trying to read a specific large CSV file (the-reddit-climate-change-dataset-comments.csv from this dataset) by batches. This is my code:

      import os
      
      import pyarrow as pa
      from pyarrow.csv import open_csv, ReadOptions
      import pyarrow.parquet as pq
      
      filename = "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv"
      
      print(f"Reading {filename}...")
      mmap = pa.memory_map(filename)
      
      reader = open_csv(mmap)
      while True:
          try:
              batch = reader.read_next_batch()
              print(len(batch))
          except StopIteration:
              break
      

      But, after a few batches, I get an exception:

      Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv...
      1233
      1279
      1293
      
      ---------------------------------------------------------------------------
      ArrowInvalid                              Traceback (most recent call last)
      Input In [1], in <cell line: 14>()
           13 while True:
           14     try:
      ---> 15         batch = reader.read_next_batch()
           16         print(len(batch))
           17     except StopIteration:
      
      File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in pyarrow.lib.RecordBatchReader.read_next_batch()
      
      File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()
      
      ArrowInvalid: CSV parser got out of sync with chunker
      

      I have tried changing the block size, but I always end up with that error sooner or later:

      • With read_options=ReadOptions(block_size=10_000), it reads 1 batch of 11 rows and then crashes
      • With 100_000, 103 rows and then crashes
      • 1_000_000: 1164 rows and then crashes
      • 10_000_000: 12370 rows and then crashes

      I am not sure what else to try here. According to the C++ source code, this "should not happen".

      I have tried with pyarrow 7.0 and 9.0, identical result and traceback.

      Attachments

        Activity

          People

            Unassigned Unassigned
            astrojuanlu Juan Luis Cano Rodríguez
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: