Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.0.1
-
None
Description
When calling Table.from_pandas() with a large dataset with a column of vectors (np.array), there is an `ArrowCapacityError`
To reproduce:
import pandas as pd import numpy as np import pyarrow as pa n = 1713614 df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)}) pa.Table.from_pandas(df)
With a smaller n it works.
Error raised:
--------------------------------------------------------------------------- ArrowCapacityError Traceback (most recent call last) <ipython-input-7-1a7b68a179a0> in <module> ----> 1 _ = pa.Table.from_pandas(df) ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas() ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) 591 for i, maybe_fut in enumerate(arrays): 592 if isinstance(maybe_fut, futures.Future): --> 593 arrays[i] = maybe_fut.result() 594 595 types = [x.type for x in arrays] ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in result(self, timeout) 423 raise CancelledError() 424 elif self._state == FINISHED: --> 425 return self.__get_result() 426 427 self._condition.wait(timeout) ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in __get_result(self) 382 def __get_result(self): 383 if self._exception: --> 384 raise self._exception 385 else: 386 return self._result ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py in run(self) 55 56 try: ---> 57 result = self.fn(*self.args, **self.kwargs) 58 except BaseException as exc: 59 self.future.set_exception(exc) ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field) 557 558 try: --> 559 result = pa.array(col, type=type_, from_pandas=True, safe=safe) 560 except (pa.ArrowInvalid, 561 pa.ArrowNotImplementedError, ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array() ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array() ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
I guess one needs to chunk the data before creating the arrays ?
Attachments
Issue Links
- is fixed by
-
ARROW-9992 [C++][Python] Refactor python to arrow conversions based on a reusable conversion API
- Resolved