Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9976

[Python] ArrowCapacityError when doing Table.from_pandas with large dataframe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.0.1
    • 2.0.0
    • Python
    • None

    Description

      When calling Table.from_pandas() with a large dataset with a column of vectors (np.array), there is an `ArrowCapacityError`

      To reproduce:

      import pandas as pd
      import numpy as np
      import pyarrow as pa
      
      n = 1713614
      df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
      pa.Table.from_pandas(df)
      

      With a smaller n it works.

      Error raised:

      ---------------------------------------------------------------------------
      ArrowCapacityError                        Traceback (most recent call last)
      <ipython-input-7-1a7b68a179a0> in <module>
      ----> 1 _ = pa.Table.from_pandas(df)
      
      ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
      
      ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
          591         for i, maybe_fut in enumerate(arrays):
          592             if isinstance(maybe_fut, futures.Future):
      --> 593                 arrays[i] = maybe_fut.result()
          594 
          595     types = [x.type for x in arrays]
      
      ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
          423                 raise CancelledError()
          424             elif self._state == FINISHED:
      --> 425                 return self.__get_result()
          426 
          427             self._condition.wait(timeout)
      
      ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
          382     def __get_result(self):
          383         if self._exception:
      --> 384             raise self._exception
          385         else:
          386             return self._result
      
      ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py in run(self)
           55 
           56         try:
      ---> 57             result = self.fn(*self.args, **self.kwargs)
           58         except BaseException as exc:
           59             self.future.set_exception(exc)
      
      ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
          557 
          558         try:
      --> 559             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
          560         except (pa.ArrowInvalid,
          561                 pa.ArrowNotImplementedError,
      
      ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
      
      ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
      
      ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
      
      ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
      

      I guess one needs to chunk the data before creating the arrays ?

      Attachments

        Issue Links

          Activity

            People

              kszucs Krisztian Szucs
              lhoestq quentin lhoest
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: