Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10027

[Python] Incorrect null column returned when using a dataset filter expression.

    XMLWordPrintableJSON

Details

    Description

      When using dataset filter expressions (which I <3) with Parquet files, entire null columns are returned, rather than rows that matched other columns in the filter.

      Here's an example.

      In [7]: import pyarrow as pa
      In [8]: import pyarrow.dataset as ds
      In [9]: import pyarrow.parquet as pq
      
      In [10]: table = pa.Table.from_arrays(
       ...:     arrays=[
       ...:         pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
       ...:         pa.array(["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]),
       ...:         pa.array([None, None, None, None, None, None, None, None, None, None]),
       ...:     ],
       ...:     names=["id", "name", "other"],
       ...: )
      
      In [11]: table
      Out[11]:
      pyarrow.Table
      id: int64
      name: string
      other: null
      
      In [12]: table.to_pandas()
      Out[12]:
         id   name other
      0   0   zero  None
      1   1    one  None
      2   2    two  None
      3   3  three  None
      4   4   four  None
      5   5   five  None
      6   6    six  None
      7   7  seven  None
      8   8  eight  None
      9   9   nine  None
      
      In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
      In [14]: data = ds.dataset("/tmp/test.parquet")
      In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
      In [16]: table
      Out[16]:
      pyarrow.Table
      id: int64
      name: string
      other: null
      
      In [17]: table.to_pydict()
      Out[17]:
      {'id': [1, 4, 7],
       'name': ['one', 'four', 'seven'],
       'other': [None, None, None, None, None, None, None, None, None, None]}
      

      The to_pydict method highlights the strange behavior: the id and name columns have 3 elements, but the other column has all 10. When I call to_pandas on the filtered table, the program crashes.

      This could be a C++ issue, but, since my examples are in Python, I categorized it as a Python issue. Let me know if that's wrong and I'll note that for the future.

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              tazimmerman Troy Zimmerman
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m