Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4470

[Python] Pyarrow using considerable more memory when reading partitioned Parquet file

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 0.12.0
    • Fix Version/s: 1.0.0
    • Component/s: Python

      Description

      Hi,

      I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, with the following structure:

      /data/myparquettable/year=2016/data/myparquettable/year=2016/myfile_1.prt

      /data/myparquettable/year=2016/myfile_2.prt

      /data/myparquettable/year=2016/myfile_3.prt

      /data/myparquettable/year=2017

      /data/myparquettable/year=2017/myfile_1.prt

      /data/myparquettable/year=2017/myfile_2.prt

      /data/myparquettable/year=2017/myfile_3.prt

      and so on. I need to work with one partition, so I copied one partition to a local filesystem:

      hdfs fs -get /data/myparquettable/year=2017 /local/

      so now I have some data on the local disk:

      {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }}

      etc.I tried to read it using Pyarrow:

      import pyarrow.parquet as pqpq.read_parquet('/local/year=2017')

      and it starts reading. The problem is that the local Parquet files are around 15GB total, and I blew up my machine memory a couple of times because when reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure how much it will take because it never finishes. Is this expected? Is there a workaround?

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ispmarin Ivan SPM
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: