[ARROW-6059] [Python] Regression memory issue when calling pandas.read_parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.14.0, 0.14.1
Fix Version/s: 0.15.0
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/22461

Description

I have a ~3MB parquet file with the next schema:

bag_stamp: timestamp[ns]
transforms_[]_.header.seq: list<item: int64>
  child 0, item: int64
transforms_[]_.header.stamp: list<item: timestamp[ns]>
  child 0, item: timestamp[ns]
transforms_[]_.header.frame_id: list<item: string>
  child 0, item: string
transforms_[]_.child_frame_id: list<item: string>
  child 0, item: string
transforms_[]_.transform.translation.x: list<item: double>
  child 0, item: double
transforms_[]_.transform.translation.y: list<item: double>
  child 0, item: double
transforms_[]_.transform.translation.z: list<item: double>
  child 0, item: double
transforms_[]_.transform.rotation.x: list<item: double>
  child 0, item: double
transforms_[]_.transform.rotation.y: list<item: double>
  child 0, item: double
transforms_[]_.transform.rotation.z: list<item: double>
  child 0, item: double
transforms_[]_.transform.rotation.w: list<item: double>
  child 0, item: double

If I read it with pandas.read_parquet() using pyarrow 0.13.0 all seems fine and it takes no time to load. If I try the same with 0.14.0 or 0.14.1 it takes a lot of time and uses ~10GB of RAM. Many times if I don't have enough available memory it will just be killed OOM. Now, if I use the next code snippet instead it works perfectly with all the versions:

parquet_file = pq.ParquetFile(input_file)
tables = []
for row_group in range(parquet_file.num_row_groups):
    tables.append(parquet_file.read_row_group(row_group, columns=columns, use_pandas_metadata=True))
df = pa.concat_tables(tables).to_pandas()

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Memory_profile_0.14.1_use_thread_true.png
30/Jul/19 15:50
104 kB
Olivier Giboin
Memory_profile_0.14.1_use_thread_FALSE.png
30/Jul/19 15:50
98 kB
Olivier Giboin
Memory_profile_0.14.1_use_thread_false_rs.png
30/Jul/19 15:58
35 kB
Olivier Giboin
Memory_profile_0.13.png
30/Jul/19 15:50
105 kB
Olivier Giboin
Memory_profile_0.13_rs.png
30/Jul/19 15:54
37 kB
Olivier Giboin

Issue Links

is related to

ARROW-6060 [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Francisco Sanchez

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/Jul/19 10:21

Updated:: 11/Jan/23 07:44

Resolved:: 07/Aug/19 18:35