[ARROW-1956] [Python] Support reading specific partitions from a partitioned parquet dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.8.0
Fix Version/s: None
Component/s: Python
Labels:
Environment:
Kernel: 4.14.8-300.fc27.x86_64
Python: 3.6.3

External issue URL:
https://github.com/apache/arrow/issues/17944

Description

I want to read specific partitions from a partitioned parquet dataset. This is very useful in case of large datasets. I have attached a small script that creates a dataset and shows what is expected when reading (quoting salient points below).

There is no way to read specific partitions in Pandas
In pyarrow I tried to achieve the goal by providing a list of files/directories to ParquetDataset, but it didn't work:

In PySpark it works if I simply do:

spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)

I also couldn't find a way to easily write partitioned parquet files. In the end I did it by hand by creating the directory hierarchies, and writing the individual files myself (similar to the implementation in the attached script). Again, in PySpark I can do

df.write.partitionBy(*list_of_partitions).parquet(output)

to achieve that.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

so-example.py
29/Dec/17 06:40
2 kB
Suvayu Ali

Activity

People

Assignee:: Unassigned

Reporter:: Suvayu Ali

Votes:: 3 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 29/Dec/17 06:58

Updated:: 11/Jan/23 07:18

Resolved:: 12/Nov/20 17:14