Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16436

[C++] Datasets ignores CSV autogenerate_column_names during discovery

    XMLWordPrintableJSON

Details

    Description

      Reproduction

      import tempfile
      from pathlib import Path
      
      import pyarrow as pa
      import pyarrow.csv as csv
      import pyarrow.dataset as ds
      
      print("PyArrow version:", pa.__version__)
      
      ro = csv.ReadOptions(autogenerate_column_names=True)
      po = csv.ParseOptions()
      co = csv.ConvertOptions()
      file_format = ds.CsvFileFormat(read_options=ro, parse_options=po, convert_options=co)
      
      with tempfile.TemporaryDirectory() as td:
          td = Path(td).resolve()
          with (td / "test.csv").open("w") as sink:
              sink.write("1,a,true,1\n")
      
          dataset = ds.dataset(str(td), format=file_format)
          print(dataset.to_table())
      

      Result:

      PyArrow version: 7.0.0
      Traceback (most recent call last):
        File "/home/lidavidm/csvdemo.py", line 20, in <module>
          dataset = ds.dataset(str(td), format=file_format)
        File "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py", line 667, in dataset
          return _filesystem_dataset(source, **kwargs)
        File "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py", line 422, in _filesystem_dataset
          return factory.finish(schema)
        File "pyarrow/_dataset.pyx", line 1680, in pyarrow._dataset.DatasetFactory.finish
        File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
        File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/tmp5rz0ipmm/test.csv': Could not open CSV input source '/tmp/tmp5rz0ipmm/test.csv': Invalid: CSV file contained multiple columns named 1. Is this a 'csv' file?
      

      Attachments

        Issue Links

          Activity

            People

              raulcd Raúl Cumplido
              lidavidm David Li
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m