Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16436

[C++] Datasets ignores CSV autogenerate_column_names during discovery

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Reproduction

      import tempfile
      from pathlib import Path
      
      import pyarrow as pa
      import pyarrow.csv as csv
      import pyarrow.dataset as ds
      
      print("PyArrow version:", pa.__version__)
      
      ro = csv.ReadOptions(autogenerate_column_names=True)
      po = csv.ParseOptions()
      co = csv.ConvertOptions()
      file_format = ds.CsvFileFormat(read_options=ro, parse_options=po, convert_options=co)
      
      with tempfile.TemporaryDirectory() as td:
          td = Path(td).resolve()
          with (td / "test.csv").open("w") as sink:
              sink.write("1,a,true,1\n")
      
          dataset = ds.dataset(str(td), format=file_format)
          print(dataset.to_table())
      

      Result:

      PyArrow version: 7.0.0
      Traceback (most recent call last):
        File "/home/lidavidm/csvdemo.py", line 20, in <module>
          dataset = ds.dataset(str(td), format=file_format)
        File "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py", line 667, in dataset
          return _filesystem_dataset(source, **kwargs)
        File "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py", line 422, in _filesystem_dataset
          return factory.finish(schema)
        File "pyarrow/_dataset.pyx", line 1680, in pyarrow._dataset.DatasetFactory.finish
        File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
        File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/tmp5rz0ipmm/test.csv': Could not open CSV input source '/tmp/tmp5rz0ipmm/test.csv': Invalid: CSV file contained multiple columns named 1. Is this a 'csv' file?
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            raulcd Raúl Cumplido Assign to me
            lidavidm David Li
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m

                Slack

                  Issue deployment