Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9864

[Python] pathlib.Path not supported in write_to_dataset with partition columns

    XMLWordPrintableJSON

Details

    Description

      Copying over from https://github.com/pandas-dev/pandas/issues/35902

      import pathlib
      
      df = pd.DataFrame({'A':[1,2,3,4], 'B':'C'})
      
      df.to_parquet('tmp_path1.parquet')  # OK
      df.to_parquet(pathlib.Path('tmp_path2.parquet'))  # OK
      
      df.to_parquet('tmp_path3.parquet', partition_cols=['B'])  # OK
      df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B'])  # TypeError
      

      to_parquet method raises TypeError when using pathlib.Path() as an argument in case when `partition_cols` argument is not None. If no partition cols are provided, then pathlib.Path() is properly accepted

      ---------------------------------------------------------------------------
      TypeError                                 Traceback (most recent call last)
      <ipython-input-53-cae5a944d982> in <module>
            3 
            4 df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK
      ----> 5 df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B'])  # TypeError
      ...
      
      ~/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py in write_to_dataset(table, root_path, partition_cols, partition_filename_cb, filesystem, **kwargs)
         1790             subtable = pa.Table.from_pandas(subgroup, schema=subschema,
         1791                                             safe=False)
      -> 1792             _mkdir_if_not_exists(fs, '/'.join([root_path, subdir]))
         1793             if partition_filename_cb:
         1794                 outfile = partition_filename_cb(keys)
      
      TypeError: sequence item 0: expected str instance, PosixPath found
      

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m