Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16045

Version=7.0.0 introduces bug when filtering by empty set during load

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.0.0
    • 6.0.1
    • Python
    • None
    • pandas 1.3.5
      pyarrow 7.0.0
      python 3.10.4

    Description

      This issue is present in pyarrow v7.0.0, but not in v6.0.1.

      Pyarrow errors when attempting to read from a parquet file with an empty filter on a string and categorical column. These are columns "E" and "F". Interestingly the issue is not present in v7.0.0 when reading from a float, timestamp or integer column ("A" through "D").

       

      The following Python code presents a minimal example which reproduces the issue:

      import pandas as pd
      import numpy as np
      path = './example_df.parquet'
      df = pd.DataFrame(
          {
              "A": 1.0,
              "B": pd.Timestamp("20130102"),
              "C": pd.Series(1, index=list(range(4)), dtype="float32"),
              "D": np.array([3] * 4, dtype="int32"),
              "E": pd.Categorical(["test", "train", "test", "train"]),
              "F": "foo",
          }
      )
      df.to_parquet(path)
      
      # Works!
      df_read = pd.read_parquet(
          path,
          filters=[
              [
                  ("A", "in", set())
              ]
          ]
      )
      
      # Pyarrow v6.0.1 and v7.0.0
      #
      # Empty DataFrame
      # Columns: [A, B, C, D, E, F]
      # Index: []
      print(df_read)
      
      # Fails!
      df_read = pd.read_parquet(
          path,
          filters=[
              [
                  ("F", "in", set())
              ]
          ]
      )
      
      # Pyarrow v6.0.1
      #
      # Empty DataFrame
      # Columns: [A, B, C, D, E, F]
      # Index: []
      
      # Pyarrow v7.0.0
      #
      # pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string vs null
      print(df_read) 

      Attachments

        Activity

          People

            Unassigned Unassigned
            damianb Damian Barabonkov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: