Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12025

[Python] pyarrow read_csv works incorrectly with multilines if skiprows is present

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Not A Problem
    • 3.0.0
    • None
    • Python
    • None

    Description

      Reproducer:
      import os

      from pyarrow.csv import read_csv, ReadOptions
      import pyarrow
      print("pyarrow._version:", pyarrow.version_)

      test_filename = "test.csv"
      test_data = """col1,col2,col3,col4
      "This is a very long
      string with several
      newline characters",2,3,4
      """

      try :
          with open(test_filename, "w") as f:
              f.write(test_data)

          ans_1 = read_csv(test_filename) # works fine
          print("ans_1: \n", ans_1)
          ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
          print("ans_2: \n", ans_2)
      finally:
          os.remove(test_filename)
       
      Output:
      pyarrow._version_: 3.0.0
      ans_1:
      pyarrow.Table
      col1: string
      col2: int64
      col3: int64
      col4: int64
      Traceback (most recent call last):
      File "pyarrow_bug.py", line 21, in <module>
      ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
      File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
      File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 4
       
      Note: python version: 3.8.8, platform: Ubuntu 20.04

      Attachments

        Activity

          People

            Unassigned Unassigned
            alexander_m Alexander M
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: