[ARROW-2462] [C++] Segfault when writing a parquet table containing a dictionary column from Record Batch Stream - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.10.0
Fix Version/s: 0.10.0
Component/s: C++, Python
Labels:
- pull-request-available

Flags:

Patch
External issue URL:
https://github.com/apache/arrow/issues/18631

Description

Discovered this through using pyarrow and dealing with RecordBatch Streams and parquet. The issue can be replicated as follows:

import pyarrow as pa
import pyarrow.parquet as pq

# create record batch with 1 dictionary column
indices = pa.array([1,0,1,1,0])
dictionary = pa.array(['Foo', 'Bar'])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
rb = pa.RecordBatch.from_arrays( [ dict_array ], [ 'd0' ] )

# write out using RecordBatchStreamWriter
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, rb.schema)
writer.write_batch(rb)
writer.close()
buf = sink.get_result()

# read in and try to write parquet table
reader = pa.open_stream(buf)
tbl = reader.read_all()
pq.write_table(tbl, 'dict_table.parquet') # SEGFAULTS

When writing record batch streams, if there are no nulls in an array, Arrow will put a placeholder nullptr instead of putting the full bitmap of 1s, when deserializing that stream, the bitmap for the nulls isn't populated and is left to being a nullptr. When attempting to write this table via pyarrow.parquet, you end up here in the parquet writer code which attempts to Cast the dictionary to a non-dictionary representation. Since the null count isn't checked before creating a BitmapReader, the BitmapReader is constructed with a nullptr for the bitmap_data, but a non-zero length which then segfaults in the constructor here because bitmap is null.

So a simple check of the null count before constructing the BitmapReader avoids the segfault.

Already filed PR 1896

Attachments

Issue Links

links to

GitHub Pull Request #1896

Activity

People

Assignee:: Matthew Topol

Reporter:: Matthew Topol

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Apr/18 19:31

Updated:: 11/Jan/23 07:21

Resolved:: 01/May/18 09:22

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m