Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
0.15.1
-
Ubuntu bionic
Description
Hi, I have found two problematic cases, possibly bugs, in pyarrow read_csv module. I have written the following piece of code and run a test on the attached CSV file.
The code compares pandas read_csv with pyarrow csv to show that the second is not behaving correctly with the following set of parameters:
1. change parameter skip_rows = 10,
Traceback (most recent call last): File "/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-21-8c5c88b190c4>", line 4, in <module> read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names) File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowKeyError: Column 'catcost' in include_columns does not exist in CSV file
2. change parameters skip_rows = 12, columns = None
In this case you don't get the error above, all columns are fetched, but compare the two dataframes, the one from pyarrow with to_pandas() and the one from the output of pandas read_csv(). You will notice that the first one has not parsed correctly the null values ('
N') in the last column catname. On the contrary pandas read_csv managed to parse all the null values correctly.
Out[28]: 1082 991 16.5 200 2014-09-10 1 bar 0 1082 997 0.55 100.0 2014-09-10 1 bar 1 1082 998 7.95 200.0 2014-03-03 0 \N 2 1083 998 12.50 NaN NaT 0 bar 3 1083 999 1.00 NaN NaT 0 foo 4 1084 994 57.30 100.0 2014-12-20 1 \N 5 1084 995 22.20 NaN NaT 0 foo 6 1084 998 48.60 200.0 2014-12-20 1 foo
Python code to test the attached CSV file for the bugs reported above
from pyarrow import csv import pyarrow as pa import pandas as pd file_location = 'spc_catalog.tsv' sep = '\t' nulls=['\\N'] columns = ['catcost', 'catqnt', 'catdate', 'catchk', 'catname'] column_names = None column_types = None skip_rows = None nrecords = None csv.read_csv(file_location, parse_options=csv.ParseOptions(delimiter=sep), convert_options=csv.ConvertOptions(include_columns=columns, column_types=column_types, null_values=nulls), read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names) ).to_pandas() pd.read_csv(file_location, sep=sep, na_values='\\N', usecols=columns, nrows=nrecords, names=column_names, dtype=column_types)
Attachments
Attachments
Issue Links
- links to