[ARROW-9976] [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0.1
Fix Version/s: 2.0.0
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/26002

Description

When calling Table.from_pandas() with a large dataset with a column of vectors (np.array), there is an `ArrowCapacityError`

To reproduce:

import pandas as pd
import numpy as np
import pyarrow as pa

n = 1713614
df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
pa.Table.from_pandas(df)

With a smaller n it works.

Error raised:

---------------------------------------------------------------------------
ArrowCapacityError                        Traceback (most recent call last)
<ipython-input-7-1a7b68a179a0> in <module>
----> 1 _ = pa.Table.from_pandas(df)

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    591         for i, maybe_fut in enumerate(arrays):
    592             if isinstance(maybe_fut, futures.Future):
--> 593                 arrays[i] = maybe_fut.result()
    594 
    595     types = [x.type for x in arrays]

~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    423                 raise CancelledError()
    424             elif self._state == FINISHED:
--> 425                 return self.__get_result()
    426 
    427             self._condition.wait(timeout)

~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    557 
    558         try:
--> 559             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    560         except (pa.ArrowInvalid,
    561                 pa.ArrowNotImplementedError,

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648

I guess one needs to chunk the data before creating the arrays ?

Attachments

Issue Links

is fixed by

ARROW-9992 [C++][Python] Refactor python to arrow conversions based on a reusable conversion API

Resolved

Activity

People

Assignee:: Krisztian Szucs

Reporter:: quentin lhoest

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Sep/20 15:24

Updated:: 11/Jan/23 08:10

Resolved:: 28/Sep/20 09:22