Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13150

[Python] combine_chunks fails on column of table, but does not error on table itself

Add voteWatch issue
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • Python
    • None

    Description

      combine_chunks fails on column of table, but does not error on table itself (but creates 3 chunks instead).

      Is there a reason why they are not handled the same?

      In [90]: pa.__version__
      Out[90]: '4.0.0'
      
      # Get shape
      In [85]: pa_table.shape
      Out[85]: (102753589, 1)In [86]: pa_col1_array = pa_table.column(0)
      
      # Get number of chunks
      In [87]: pa_col1_array.num_chunks
      Out[87]: 4404
      
      # Combining chunks on the pyarrow table with one column works.
      In [88]: pa_table.combine_chunks()
      Out[88]: 
      pyarrow.Table
      # id=TEW__014e25__c14e1d__Multiome_RNA_brain_10x_no_perm: string
      
      # Combining chunks on the column itself does not work.
      In [89]: pa_col1_array.combine_chunks()
      ---------------------------------------------------------------------------
      ArrowInvalid                              Traceback (most recent call last)
      <ipython-input-89-fdd0d0056a8e> in <module>
      ----> 1 pa_col1_array.combine_chunks()
      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.combine_chunks()
      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays()
      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
      ArrowInvalid: offset overflow while concatenating arrays
      
      # Assign combine chunks table to new tabled.
      In [91]: pa_table_combined = pa_table.combine_chunks()
      
      # Get first column
      In [92]: pa_col1_array_from_pa_table_combined = pa_table_combined.column(0)
      
      # Get number of chunks
      In [93]: pa_col1_array_from_pa_table_combined.num_chunks
      Out[93]: 3
      
      # Try to combine column 1 again.
      In [94]: pa_col1_array_from_pa_table_combined.combine_chunks()
      ---------------------------------------------------------------------------
      ArrowInvalid                              Traceback (most recent call last)
      <ipython-input-94-e2e323e6519f> in <module>
      ----> 1 pa_col1_array_from_pa_table_combined.combine_chunks()
      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.combine_chunks()
      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays()
      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
      ArrowInvalid: offset overflow while concatenating arrays
      
      # Get sizes of each chunk.
      In [106]: [chunk.nbytes for chunk in pa_col1_array_from_pa_table_combined.chunks]
      Out[106]: [2341650593, 2342925682, 241257842]
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ghuls Gert Hulselmans

              Dates

                Created:
                Updated:

                Slack

                  Issue deployment