Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7628

[Python] Better document some read_csv corner cases

    XMLWordPrintableJSON

Details

    Description

      Hi, I have found two problematic cases, possibly bugs, in pyarrow read_csv module. I have written the following piece of code and run a test on the attached CSV file.

      The code compares pandas read_csv with pyarrow csv to show that the second is not behaving correctly with the following set of parameters:

      1. change parameter skip_rows = 10,

      Traceback (most recent call last):
        File "/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
          exec(code_obj, self.user_global_ns, self.user_ns)
        File "<ipython-input-21-8c5c88b190c4>", line 4, in <module>
          read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names)
        File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv
        File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
      pyarrow.lib.ArrowKeyError: Column 'catcost' in include_columns does not exist in CSV file
      

      2. change parameters skip_rows = 12, columns = None
      In this case you don't get the error above, all columns are fetched, but compare the two dataframes, the one from pyarrow with to_pandas() and the one from the output of pandas read_csv(). You will notice that the first one has not parsed correctly the null values ('
      N') in the last column catname. On the contrary pandas read_csv managed to parse all the null values correctly.

      Out[28]: 
         1082  991   16.5    200 2014-09-10  1  bar
      0  1082  997   0.55  100.0 2014-09-10  1  bar
      1  1082  998   7.95  200.0 2014-03-03  0   \N
      2  1083  998  12.50    NaN        NaT  0  bar
      3  1083  999   1.00    NaN        NaT  0  foo
      4  1084  994  57.30  100.0 2014-12-20  1   \N
      5  1084  995  22.20    NaN        NaT  0  foo
      6  1084  998  48.60  200.0 2014-12-20  1  foo
      
      

      Python code to test the attached CSV file for the bugs reported above

      from pyarrow import csv
      import pyarrow as pa
      import pandas as pd
      
      file_location = 'spc_catalog.tsv'
      
      sep = '\t'
      nulls=['\\N']
      
      columns = ['catcost', 'catqnt', 'catdate', 'catchk', 'catname']
      column_names = None
      column_types = None
      
      skip_rows = None
      nrecords = None
      
      csv.read_csv(file_location,
          parse_options=csv.ParseOptions(delimiter=sep),
          convert_options=csv.ConvertOptions(include_columns=columns, column_types=column_types, null_values=nulls),
          read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names)
      ).to_pandas()
      
      pd.read_csv(file_location, sep=sep, na_values='\\N', usecols=columns, nrows=nrecords, names=column_names, dtype=column_types)
      
      

      Attachments

        1. spc_catalog.tsv
          0.6 kB
          Athanassios Hatzis

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              athanassios Athanassios Hatzis
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m