Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1956

[Python] Support reading specific partitions from a partitioned parquet dataset

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0
    • Fix Version/s: 1.0.0
    • Component/s: Python
    • Labels:
    • Environment:
      Kernel: 4.14.8-300.fc27.x86_64
      Python: 3.6.3

      Description

      I want to read specific partitions from a partitioned parquet dataset. This is very useful in case of large datasets. I have attached a small script that creates a dataset and shows what is expected when reading (quoting salient points below).

      1. There is no way to read specific partitions in Pandas
      2. In pyarrow I tried to achieve the goal by providing a list of files/directories to ParquetDataset, but it didn't work:
      3. In PySpark it works if I simply do:
        spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
        

      I also couldn't find a way to easily write partitioned parquet files. In the end I did it by hand by creating the directory hierarchies, and writing the individual files myself (similar to the implementation in the attached script). Again, in PySpark I can do

      df.write.partitionBy(*list_of_partitions).parquet(output)
      

      to achieve that.

        Attachments

        1. so-example.py
          2 kB
          Suvayu Ali

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              suvayu Suvayu Ali
            • Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated: