Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10882

[Python][Dataset] Writing dataset from python iterator of record batches

    XMLWordPrintableJSON

Details

    Description

      At the moment, from python you can write a dataset with ds.write_dataset for example starting from a list of record batches.

      But this currently needs to be an actual list (or gets converted to a list), so an iterator or generated gets fully consumed (potentially bringing the record batches in memory), before starting to write.

      We should also be able to use the python iterator itself to back a RecordBatchIterator-like object, that can be consumed while writing the batches.

      We already have a arrow::py::PyRecordBatchReader that might be useful here.

      Attachments

        Issue Links

          Activity

            People

              lidavidm David Li
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m