[ARROW-6380] Method pyarrow.parquet.read_table has memory spikes from version 0.14 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.14.0, 0.14.1
Fix Version/s: None
Component/s: C++
Labels:
None
Environment:
ubuntu 18, 16GB ram, 4 cpus

External issue URL:
https://github.com/apache/arrow/issues/22753

Description

Method pyarrow.parquet.read_table is very slow and cause RAM spikes from version 0.14.0

Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x

This impact in performance is easily measured. However, there is another problem that I could only detect on htop screen. While opening a 40MB parquet, the process occupies almost 16GB for some miliseconds. The pyarrow table will result in around 300MB in the python process (registered using memory-profiler). This does not happens in versions 0.13 and previous ones.

Attachments

Issue Links

duplicates

ARROW-6060 [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Renan Alves Fonseca

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Aug/19 19:06

Updated:: 11/Jan/23 07:46

Resolved:: 30/Aug/19 14:02

Agile

View on Board

Method pyarrow.parquet.read_table has memory spikes from version 0.14