Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
0.8.0
-
None
-
Kernel: 4.14.8-300.fc27.x86_64
Python: 3.6.3
Description
I want to read specific partitions from a partitioned parquet dataset. This is very useful in case of large datasets. I have attached a small script that creates a dataset and shows what is expected when reading (quoting salient points below).
- There is no way to read specific partitions in Pandas
- In pyarrow I tried to achieve the goal by providing a list of files/directories to ParquetDataset, but it didn't work:
- In PySpark it works if I simply do:
spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
I also couldn't find a way to easily write partitioned parquet files. In the end I did it by hand by creating the directory hierarchies, and writing the individual files myself (similar to the implementation in the attached script). Again, in PySpark I can do
df.write.partitionBy(*list_of_partitions).parquet(output)
to achieve that.