Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15270

[Python] Make dataset.dataset() accept a list of directories as source

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Information Provided
    • None
    • None
    • Python
    • None

    Description

      Currently, if I partition a dataset as shown below, a directory partitioned is created along with 2001, 2002, 2003, 2004 as subdirectories. But then, if I wanted to only read the partitions corresponding to years 2001, 2002, I wouldn't have a straightforward way of doing so.

      >>> table = pa.table({'month': [1, 2, 3, 4, 5], 'year': [2001, 2002, 2003, 2004, 2004]})
      >>> table
      pyarrow.Table
      month: int64
      year: int64
      ----
      month: [[1,2,3,4,5]]
      year: [[2001,2002,2003,2004,2004]]
      >>> ds.write_dataset(data=table, base_dir="partitioned", format="ipc", partitioning=ds.partitioning(pa.schema([("year", pa.int64())])))
      >>> f = fs.SubTreeFileSystem(base_path='partitioned', base_fs=fs.LocalFileSystem())
      >>> ds.dataset(source=['2001','2002'], filesystem=f, format="ipc")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py", line 683, in dataset
          return _filesystem_dataset(source, **kwargs)
        File "/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py", line 423, in _filesystem_dataset
          fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
        File "/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py", line 344, in _ensure_multiple_sources
          raise IsADirectoryError(
      IsADirectoryError: Path 2001 points to a directory, but only file paths are supported. To construct a nested or union dataset pass a list of dataset objects instead.

      Since dataset.write_dataset() produces this file structure, maybe dataset.dataset() should accept a list of directories as source?

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            userr2232 Reynaldo Rojas Zelaya
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: