[ARROW-3139] [Python] ArrowIOError: Arrow error: Capacity error during read - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.10.0
Fix Version/s: None
Component/s: Python
Labels:
- parquet
Environment:
pandas=0.23.1=py36h637b7d7_0
pyarrow==0.10.0

External issue URL:
https://github.com/apache/arrow/issues/2485

Description

My assumption: the problem is caused by a large object column containing strings up to 27 characters long. (so that column is much larger than 2GB of strings, chunking issue)

looks similar as https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574

Code

basket_plateau= pq.read_table("basket_plateau.parquet")
basket_plateau = pd.read_parquet("basket_plateau.parquet")

Error produced

ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483655

Dataset

Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
2.7 billion record, 4 columns ( int64/object/datetime64/float64)
aprox 90GB in memory
example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think food retail categories)

History to bug:

was using older version of pyarrow
tried writing dataset to disk (parquet) and failed
stumbled on https://issues.apache.org/jira/browse/ARROW-2227
upgraded to 0.10
tried writing dataset to disk (parquet) and succeeded
tried reading dataset and failed
looks like a similar case as: https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574

Attachments

Issue Links

duplicates

ARROW-3762 [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Frédérique Vanneste

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/Aug/18 07:30

Updated:: 11/Jan/23 07:25

Resolved:: 13/Nov/18 14:48