Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
Reading a dataset with a dictionary column where some of the files don't contain any data for that column (and thus are typed as null) broke with https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release though and thus I would consider this a regression.
This can be reproduced using the following Python snippet:
import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds table = pa.table({"a": [None, None]}) pq.write_table(table, "test.parquet") schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))]) fsds = ds.FileSystemDataset.from_paths( paths=["test.parquet"], schema=schema, format=pa.dataset.ParquetFileFormat(), filesystem=pa.fs.LocalFileSystem(), ) fsds.to_table()
The exception on master is currently:
--------------------------------------------------------------------------- ArrowNotImplementedError Traceback (most recent call last) <ipython-input-14-5f0bc602f16b> in <module> 6 filesystem=pa.fs.LocalFileSystem(), 7 ) ----> 8 fsds.to_table() ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table() 456 table : Table instance 457 """ --> 458 return self._scanner(**kwargs).to_table() 459 460 def head(self, int num_rows, **kwargs): ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table() 2887 result = self.scanner.ToTable() 2888 -> 2889 return pyarrow_wrap_table(GetResultValue(result)) 2890 2891 def take(self, object indices): ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \ 140 nogil except -1: --> 141 return check_status(status) 142 143 ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() 116 raise ArrowKeyError(message) 117 elif status.IsNotImplemented(): --> 118 raise ArrowNotImplementedError(message) 119 elif status.IsTypeError(): 120 raise ArrowTypeError(message) ArrowNotImplementedError: Unsupported cast from null to dictionary<values=string, indices=int32, ordered=0> (no available cast function for target type)
Attachments
Issue Links
- links to