Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17483

[Python] Support for 'pa.compute.Expression' in filter argument to 'pa.read_table'

    XMLWordPrintableJSON

Details

    Description

      Currently, the filters argument supports List[{{{}Tuple{}}}] or List[List[{{{}Tuple{}}}]] or None as its input types. I was suprised to see that Expressions were not supported, considering that filters are converted to expressions internally when using use_legacy_dataset=False.

      The check on L150-L153 short-circuits and succeeds when encountering an expression, but later fails on L2343 as the expression is evaluated as part of a boolean expression. 

      I think declaring filters using pa.compute.Expressions more pythonic and less error-prone,  and ill-formed filters will be detected much earlier than when using list-of-tuple-of-string equivalents.

      Example:

      import pyarrow as pa
      import pyarrow.compute as pc
      import pyarrow.parquet as pq
      
      # Creating a dummy table
      table = pa.table({
          'year': [2020, 2022, 2021, 2022, 2019, 2021],
          'n_legs': [2, 2, 4, 4, 5, 100],
          'animal': ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"]
      })
      pq.write_to_dataset(table, root_path='dataset_name_2', partition_cols=['year'])
      
      # Reading using 'pyarrow.compute.Expression'
      pq.read_table('dataset_name_2', columns=["n_legs", "animal"], filters=pc.field("n_legs") < 4)
      
      # Reading using List[Tuple]
      pq.read_table('dataset_name_2', columns=["n_legs", "animal"], filters=[('n_legs', '<', 4)])  

      Attachments

        Issue Links

          Activity

            People

              milesgranger Miles Granger
              patrikkj Patrik Kjærran
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m