Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9297

[C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB)

    XMLWordPrintableJSON

Details

    Description

      Related to ARROW-3762 (the parquet issue which has been solved), and discovered in ARROW-9139.

      When creating a Parquet file with a large binary column (larger than BinaryArray capacity):

      # code from the test_parquet.py::test_binary_array_overflow_to_chunked test
      values = [b'x'] + [ 
          b'x' * (1 << 20) 
      ] * 2 * (1 << 10)                                                                                                                                                                                     
      
      table = pa.table({'byte_col': values})                                                                                                                                                                    
      pq.write_table(table, "test_large_binary.parquet")                                                                                                                                                        
      

      then reading this with the parquet API works (fixed by ARROW-3762):

      In [3]: pq.read_table("test_large_binary.parquet")                                                                                                                                        
      Out[3]: 
      pyarrow.Table
      byte_col: binary
      

      but with the Datasets API this still fails:

      In [1]: import pyarrow.dataset as ds                                                                                                                                                                               
      
      In [2]: dataset = ds.dataset("test_large_binary.parquet", format="parquet")                                                                                                                                        
      
      In [4]: dataset.to_table()                                                                                                                                                                                         
      ---------------------------------------------------------------------------
      ArrowNotImplementedError                  Traceback (most recent call last)
      <ipython-input-4-6fb0d79c4511> in <module>
      ----> 1 dataset.to_table()
      
      ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
      
      ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()
      
      ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
      
      ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
      
      ArrowNotImplementedError: This class cannot yet iterate chunked arrays
      
      

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 7h 10m
                  7h 10m