Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9297

[C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB)

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Related to ARROW-3762 (the parquet issue which has been solved), and discovered in ARROW-9139.

      When creating a Parquet file with a large binary column (larger than BinaryArray capacity):

      # code from the test_parquet.py::test_binary_array_overflow_to_chunked test
      values = [b'x'] + [ 
          b'x' * (1 << 20) 
      ] * 2 * (1 << 10)                                                                                                                                                                                     
      
      table = pa.table({'byte_col': values})                                                                                                                                                                    
      pq.write_table(table, "test_large_binary.parquet")                                                                                                                                                        
      

      then reading this with the parquet API works (fixed by ARROW-3762):

      In [3]: pq.read_table("test_large_binary.parquet")                                                                                                                                        
      Out[3]: 
      pyarrow.Table
      byte_col: binary
      

      but with the Datasets API this still fails:

      In [1]: import pyarrow.dataset as ds                                                                                                                                                                               
      
      In [2]: dataset = ds.dataset("test_large_binary.parquet", format="parquet")                                                                                                                                        
      
      In [4]: dataset.to_table()                                                                                                                                                                                         
      ---------------------------------------------------------------------------
      ArrowNotImplementedError                  Traceback (most recent call last)
      <ipython-input-4-6fb0d79c4511> in <module>
      ----> 1 dataset.to_table()
      
      ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
      
      ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()
      
      ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
      
      ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
      
      ArrowNotImplementedError: This class cannot yet iterate chunked arrays
      
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bkietz Ben Kietzman Assign to me
            jorisvandenbossche Joris Van den Bossche
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 7h 10m
                7h 10m

                Slack

                  Issue deployment