[ARROW-5791] [Python] pyarrow.csv.read_csv hangs + eats all RAM - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.14.0, 0.14.1
Component/s: Python
Labels:
- pull-request-available
Environment:
Ubuntu Xenial, python 2.7

External issue URL:
https://github.com/apache/arrow/issues/22212

Description

I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K.

When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats all memory and gets killed.

More details on the conditions further. Script to run and all mentioned files are under attachments.

1) `sample_32769_cols.csv` is the dataset that suffers the problem.

2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values.

The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - no idea.

I have created flame graph for the case (1) to support this issue resolution (`graph.svg`).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sample_32768_cols.csv
29/Jun/19 23:19
537 kB
Bogdan Klichuk
sample_32769_cols.csv
29/Jun/19 23:19
537 kB
Bogdan Klichuk
graph.svg
29/Jun/19 23:19
67 kB
Bogdan Klichuk
csvtest.py
29/Jun/19 23:18
0.1 kB
Bogdan Klichuk

Issue Links

links to

GitHub Pull Request #4762

Activity

Descending order - Click to sort in ascending order

Rok Mihevc added a comment - 11/Jan/23 07:42

This issue has been migrated to issue #22212 on GitHub. Please see the migration documentation for further details.

Rok Mihevc added a comment - 11/Jan/23 07:42 This issue has been migrated to issue #22212 on GitHub. Please see the migration documentation for further details.

Bogdan Klichuk added a comment - 02/Jul/19 08:23

Thanks a lot!

Bogdan Klichuk added a comment - 02/Jul/19 08:23 Thanks a lot!

Wes McKinney added a comment - 01/Jul/19 18:43

Issue resolved by pull request 4762
https://github.com/apache/arrow/pull/4762

Wes McKinney added a comment - 01/Jul/19 18:43 Issue resolved by pull request 4762 https://github.com/apache/arrow/pull/4762

Micah Kornfield added a comment - 30/Jun/19 23:37

Made a PR, this wasn't really and overflow. Also, added a fixed cap on 1000*1024 which should be enough for anyone

Micah Kornfield added a comment - 30/Jun/19 23:37 Made a PR, this wasn't really and overflow. Also, added a fixed cap on 1000*1024 which should be enough for anyone

Wes McKinney added a comment - 30/Jun/19 19:01

I think we should support more columns

Wes McKinney added a comment - 30/Jun/19 19:01 I think we should support more columns

Micah Kornfield added a comment - 30/Jun/19 17:33

I think I know where this is occurring, will try to patch tonight.

Do we want to support more columns or throw an error?

Micah Kornfield added a comment - 30/Jun/19 17:33 I think I know where this is occurring, will try to patch tonight. Do we want to support more columns or throw an error?

Wes McKinney added a comment - 30/Jun/19 15:38

Evidently there's an int16 overflow somewhere in the Arrow CSV codebase. An initial grep didn't turn up anything obvious.

Comparisons with other CSV libraries (like pandas) probably are not relevant since there is no code overlap.

Wes McKinney added a comment - 30/Jun/19 15:38 Evidently there's an int16 overflow somewhere in the Arrow CSV codebase. An initial grep didn't turn up anything obvious. Comparisons with other CSV libraries (like pandas) probably are not relevant since there is no code overlap.

Bogdan Klichuk added a comment - 30/Jun/19 01:09

Just to point, I can successfully convert a dataframe (if I read it using Pandas) to pyarrow.Table directly.

df = pandas.read_csv('...', ...)
table = pyarrow.Table.from_pandas(df)

Bogdan Klichuk added a comment - 30/Jun/19 01:09 Just to point, I can successfully convert a dataframe (if I read it using Pandas) to pyarrow.Table directly. df = pandas.read_csv( '...' , ...) table = pyarrow.Table.from_pandas(df)

Bogdan Klichuk added a comment - 30/Jun/19 00:56

bhulette It's a shame I threw away the idea of "maybe this is just a power of 2" and didn't simply try. Great point.

Bogdan Klichuk added a comment - 30/Jun/19 00:56 bhulette It's a shame I threw away the idea of "maybe this is just a power of 2" and didn't simply try. Great point.

Brian Hulette added a comment - 30/Jun/19 00:42

Thanks for the concise bug report! I haven't had a chance to dig into this very far, but I'm sure it's not a coincidence that 32768 == 2^15. 32767 is the max of an unsigned 16-bit integer, so if we're assigning an unsigned int16 to each column somewhere it would overflow once you get beyond 32768 columns (since one column gets 0).

I'm not sure where exactly that would be happening though. My first inclination was that it would be in the element count for the vector of fields, but according to the flatbuffers page vectors are prefixed by a 32-bit element count.

Brian Hulette added a comment - 30/Jun/19 00:42 Thanks for the concise bug report! I haven't had a chance to dig into this very far, but I'm sure it's not a coincidence that 32768 == 2^15. 32767 is the max of an unsigned 16-bit integer, so if we're assigning an unsigned int16 to each column somewhere it would overflow once you get beyond 32768 columns (since one column gets 0). I'm not sure where exactly that would be happening though. My first inclination was that it would be in the element count for the vector of fields , but according to the flatbuffers page vectors are prefixed by a 32-bit element count.

People

Assignee:: Micah Kornfield

Reporter:: Bogdan Klichuk

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Jun/19 23:29

Updated:: 11/Jan/23 07:42

Resolved:: 01/Jul/19 18:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2.5h

Apache Arrow

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking