Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15311

[Python][Docs] Opening a partitioned dataset with schema and filter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Documentation, Python

    Description

      Add a note to the docs that if partitioning and schema are both specified at opening of a dataset and partitioning names are not included in the data, schema needs to include the partitioning names (directory or hive partitioning) in a case that filtering will be done.

      Example:

      import numpy as np
      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow.dataset as ds
      
      # Define the data
      table = pa.table({'one': [-1, np.nan, 2.5],
                         'two': ['foo', 'bar', 'baz'],
                         'three': [True, False, True]})
      
      # Write to partitioned dataset
      # The files will include columns "two" and "three"
      pq.write_to_dataset(table, root_path='dataset_name',
                          partition_cols=['one'])
      
      # Reading the partitioned dataset with schema not including partitioned names
      # will error
      
      schema = pa.schema([("three", "double")])
      data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
      subset = ds.field("one") == 2.5
      data.to_table(filter=subset)
      
      # And will not if done like so:
      schema = pa.schema([("three", "double"), ("one", "double")])
      data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
      subset = ds.field("one") == 2.5
      data.to_table(filter=subset)
      
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            alenkaf Alenka Frim
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: