Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12631

[Python] pyarrow.dataset.write_table should accept a Scanner to write

    XMLWordPrintableJSON

Details

    Description

      Assume you open a dataset and want to write it back with some projected columns. Currently you need to actually materialize it to a Table or convert it to an iterator of batches, for being able to write the dataset:

      import pyarrow.dataset as ds
      
      dataset = ds.dataset(pa.table({'a': [1, 2, 3]}))
      
      # write with projected columns
      projection = {'b': ds.field('a')}
      
      # this works but materializes full table
      ds.write_dataset(dataset.to_table(columns=projection), "test.parquet", format="parquet")
      
      # this requires the exact schema, which is a bit annoying as you need to construct that manually
      ds.write_dataset(dataset.to_batches(columns=projection), "test.parquet", format="parquet", schema=...<projected schema>...)
      

      You could expect to do the following?

      ds.write_dataset(dataset.scanner(columns=projection), "test.parquet", format="parquet")
      

      cc lidavidm do you think this logic is correct?

      (encountered this while trying to reproduce ARROW-12620 in Python)

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m