Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17228

[Python] dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 8.0.0
    • 10.0.0
    • Python
    • Python 3.9.13
      pyarrow 8.0.0

    Description

      In the code below:

      import pyarrow as pa
      import pyarrow.dataset as ds
      
      table = pa.Table.from_arrays(
          [
              pa.array(['a', 'b', 'c'], pa.string()),
              pa.array(['a', 'b', 'c'], pa.string()),
          ],
          names=['region', "Other"]
      )
      table_dataset = ds.dataset(table)
      columns = {
          "Region": ds.field('region'),
          "Other": ds.field('Other'),
      }
      scanner = table_dataset.scanner(columns=columns)
      
      ds.write_dataset(
          scanner,
          'newpath',
          partitioning=['Region'], partitioning_flavor='hive',
          format='parquet')
       

      I get this exception:

      KeyError: 'Column Region does not exist in schema'
       

      I suspect it is because write_dataset isn't looking at the correct schema. It should look at scanner.project_schema (rather than scanner.dataset_schema).

      I think it's just a matter of updating this line: https://github.com/apache/arrow/blob/bc6c4988691cf60ecac67542b2daa2ac19fde5d9/python/pyarrow/dataset.py#L967

       

      The issue was raised here: https://stackoverflow.com/questions/73139467/how-to-incorporate-projected-columns-in-scanner-into-new-dataset-partitioning

       

      Attachments

        Issue Links

          Activity

            People

              0x26dres &res
              0x26dres &res
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h