Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9143

[C++] RecordBatch::Slice erroneously sets non-nullable field's internal null_count to unknown

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.17.1
    • 1.0.0
    • Python
    • linux, pyarrow 0.17.1 installed with pipenv

    Description

      A segfault is triggered by calling dictionary_encode on a column after slicing a Recordbatch:

      import pyarrow as pa
      print(pa.__version__)
      
      array = pa.array(['foo', 'bar', 'baz'])
      batch = pa.RecordBatch.from_arrays([array], names=['a'])
      
      batch.column(0).dictionary_encode() ### works fine
      
      sub_batch = batch.slice(1)
      sub_batch.column(0).dictionary_encode() ### segfault
      

      Slicing the underlying array and dictionary_encoding works as expected:

      array.slice(1).dictionary_encode()
      

      For what it's worth, this can be worked around by converting the sub_batch to and from a table with:

      happy_sub_batch = pa.Table.from_batches([sub_batch]).to_batches()[0]
      happy_sub_batch.column(0).dictionary_encode() ### works fine
      

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              bmh Benedict Hutchings
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m