Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12482

[Doc][Python] Mention CSVStreamingReader pitfalls with type inference

    XMLWordPrintableJSON

Details

    Description

      Looks like Arrow infer type for the first batch and apply it for all subsequent batches. But information might be not enough to infer the type correctly for the whole file. For our particular case, Arrow infers some field in the schema as date32 from the first batch but the next batch has an empty field value that can’t be converted to date32.

      When I increase the batch size to have such a value in the first batch Arrow set string type (not sure why not nullable date32) for such a field since it can’t be converted to date32 and the whole file is read successfully.

      This problem can be easily reproduced by using the following code and attached dataset:

      import pyarrow as pa
      import pyarrow._csv as pa_csv
      import pyarrow._fs as pa_fs
      
      read_options: pa_csv.ReadOptions = pa_csv.ReadOptions(block_size=5_000_000)
      parse_options: pa_csv.ParseOptions = pa_csv.ParseOptions(newlines_in_values=True)
      convert_options: pa_csv.ConvertOptions = pa_csv.ConvertOptions(timestamp_parsers=[''])
      with pa_fs.LocalFileSystem().open_input_file("dataset.csv") as file:
       reader = pa_csv.open_csv(
       file, read_options=read_options, parse_options=parse_options, convert_options=convert_options
       )
       for batch in reader:
       table_batch = pa.Table.from_batches([batch])
       table_batch
      

      Error message:

       for batch in reader:
       File "pyarrow/ipc.pxi", line 497, in __iter__
       File "pyarrow/ipc.pxi", line 531, in pyarrow.lib.RecordBatchReader.read_next_batch
       File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
       pyarrow.lib.ArrowInvalid: In CSV column #23: CSV conversion error to date32[day]: invalid value ''
      

       
      When we use block_size `10_000_000` file can be read successfully since we have the problematic value in the first batch.

      An error occurs when I try to attach dataset, so you can download it from Google Drive here

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              oshevchenko Oleksandr Shevchenko
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m