Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.1
Description
When using dataset filter expressions (which I <3) with Parquet files, entire null columns are returned, rather than rows that matched other columns in the filter.
Here's an example.
In [7]: import pyarrow as pa In [8]: import pyarrow.dataset as ds In [9]: import pyarrow.parquet as pq In [10]: table = pa.Table.from_arrays( ...: arrays=[ ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]), ...: pa.array([None, None, None, None, None, None, None, None, None, None]), ...: ], ...: names=["id", "name", "other"], ...: ) In [11]: table Out[11]: pyarrow.Table id: int64 name: string other: null In [12]: table.to_pandas() Out[12]: id name other 0 0 zero None 1 1 one None 2 2 two None 3 3 three None 4 4 four None 5 5 five None 6 6 six None 7 7 seven None 8 8 eight None 9 9 nine None In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") In [14]: data = ds.dataset("/tmp/test.parquet") In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) In [16]: table Out[16]: pyarrow.Table id: int64 name: string other: null In [17]: table.to_pydict() Out[17]: {'id': [1, 4, 7], 'name': ['one', 'four', 'seven'], 'other': [None, None, None, None, None, None, None, None, None, None]}
The to_pydict method highlights the strange behavior: the id and name columns have 3 elements, but the other column has all 10. When I call to_pandas on the filtered table, the program crashes.
This could be a C++ issue, but, since my examples are in Python, I categorized it as a Python issue. Let me know if that's wrong and I'll note that for the future.
Attachments
Issue Links
- links to