[ARROW-10027] [Python] Incorrect null column returned when using a dataset filter expression. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.1
Fix Version/s: 2.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/17348

Description

When using dataset filter expressions (which I <3) with Parquet files, entire null columns are returned, rather than rows that matched other columns in the filter.

Here's an example.

In [7]: import pyarrow as pa
In [8]: import pyarrow.dataset as ds
In [9]: import pyarrow.parquet as pq

In [10]: table = pa.Table.from_arrays(
 ...:     arrays=[
 ...:         pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 ...:         pa.array(["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]),
 ...:         pa.array([None, None, None, None, None, None, None, None, None, None]),
 ...:     ],
 ...:     names=["id", "name", "other"],
 ...: )

In [11]: table
Out[11]:
pyarrow.Table
id: int64
name: string
other: null

In [12]: table.to_pandas()
Out[12]:
   id   name other
0   0   zero  None
1   1    one  None
2   2    two  None
3   3  three  None
4   4   four  None
5   5   five  None
6   6    six  None
7   7  seven  None
8   8  eight  None
9   9   nine  None

In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
In [14]: data = ds.dataset("/tmp/test.parquet")
In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
In [16]: table
Out[16]:
pyarrow.Table
id: int64
name: string
other: null

In [17]: table.to_pydict()
Out[17]:
{'id': [1, 4, 7],
 'name': ['one', 'four', 'seven'],
 'other': [None, None, None, None, None, None, None, None, None, None]}

The to_pydict method highlights the strange behavior: the id and name columns have 3 elements, but the other column has all 10. When I call to_pandas on the filtered table, the program crashes.

This could be a C++ issue, but, since my examples are in Python, I categorized it as a Python issue. Let me know if that's wrong and I'll note that for the future.

Attachments

Issue Links

links to

GitHub Pull Request #8209

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Troy Zimmerman

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Sep/20 21:00

Updated:: 11/Jan/23 08:10

Resolved:: 24/Sep/20 09:55

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m