Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17212 [Python] Support lazy Dataset.filter
  3. ARROW-16616

[Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method

    XMLWordPrintableJSON

Details

    Description

      To keep the Dataset api compatible with the Table one in terms of analytics capabilities, we should add a Dataset.filter method. The initial POC was based on _table_filter but that required materialising all the Dataset content after filtering as it returned an InMemoryDataset

      Given that Scanner can filter a dataset without actually materialising the data until a final step happens, it would be good to have Dataset.filter return some form of lazy dataset when the filter is only stored aside and the Scanner is created when data is actually retrieved.

      PS: Also update test_dataset_filter test to use the Dataset.filter method

      Attachments

        Issue Links

          Activity

            People

              amol- Alessandro Molina
              amol- Alessandro Molina
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 6h 40m
                  6h 40m