Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6380

Method pyarrow.parquet.read_table has memory spikes from version 0.14

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.14.0, 0.14.1
    • None
    • C++
    • None
    • ubuntu 18, 16GB ram, 4 cpus

    Description

      Method pyarrow.parquet.read_table is very slow and cause RAM spikes from version 0.14.0

      Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x

      This impact in performance is easily measured. However, there is another problem that I could only detect on htop screen. While opening a 40MB parquet, the process occupies almost 16GB for some miliseconds. The pyarrow table will result in around 300MB in the python process (registered using memory-profiler). This does not happens in versions 0.13 and previous ones.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rafonseca Renan Alves Fonseca
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: