[ARROW-5791] [Python] pyarrow.csv.read_csv hangs + eats all RAM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.14.0, 0.14.1
Component/s: Python
Labels:
- pull-request-available
Environment:
Ubuntu Xenial, python 2.7

External issue URL:
https://github.com/apache/arrow/issues/22212

Description

I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K.

When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats all memory and gets killed.

More details on the conditions further. Script to run and all mentioned files are under attachments.

1) `sample_32769_cols.csv` is the dataset that suffers the problem.

2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values.

The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - no idea.

I have created flame graph for the case (1) to support this issue resolution (`graph.svg`).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

csvtest.py
29/Jun/19 23:18
0.1 kB
Bogdan Klichuk
graph.svg
29/Jun/19 23:19
67 kB
Bogdan Klichuk
sample_32768_cols.csv
29/Jun/19 23:19
537 kB
Bogdan Klichuk
sample_32769_cols.csv
29/Jun/19 23:19
537 kB
Bogdan Klichuk

Issue Links

links to

GitHub Pull Request #4762

Activity

People

Assignee:: Micah Kornfield

Reporter:: Bogdan Klichuk

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Jun/19 23:29

Updated:: 11/Jan/23 07:42

Resolved:: 01/Jul/19 18:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2.5h