Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8074

[C++][Dataset] Support for file-like objects (buffers) in FileSystemDataset?

    XMLWordPrintableJSON

Details

    Description

      The current pyarrow.parquet.read_table/ParquetFile can work with buffer (reader) objects (file-like objects, pyarrow.Buffer, pyarrow.BufferReader) as input when dealing with single files. This functionality is for example being used by pandas and kartothek (in addition to being extensively used in our own tests as well).

      While we could keep the old implementation to handle single files (which is different from the ParquetDataset logic), there are also some advantages of being able to handle this in the Datasets API.
      For example, this would enable to filtering functionality of the datasets API, also for this single-file buffers use case, which would be a nice enhancement (currently, read_table does not support filters in case of single files, which is eg why kartothek implements this themselves).

      Would this be possible to support?

      The arrow::dataset::FileSource already has PATH and BUFFER enum types (https://github.com/apache/arrow/blob/08f8bff05af37921ff1e5a2b630ce1e7ec1c0ede/cpp/src/arrow/dataset/file_base.h#L46-L49), so it seems in principle possible to create a FileSource (for a FileSystemDataset / FileFragment) from a buffer instead of from a path?

      Attachments

        Activity

          People

            bkietz Ben Kietzman
            jorisvandenbossche Joris Van den Bossche
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 9h 20m
                9h 20m