Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
To keep the Dataset api compatible with the Table one in terms of analytics capabilities, we should add a Dataset.filter method. The initial POC was based on _table_filter but that required materialising all the Dataset content after filtering as it returned an InMemoryDataset.
Given that Scanner can filter a dataset without actually materialising the data until a final step happens, it would be good to have Dataset.filter return some form of lazy dataset when the filter is only stored aside and the Scanner is created when data is actually retrieved.
PS: Also update test_dataset_filter test to use the Dataset.filter method
Attachments
Issue Links
- is related to
-
ARROW-16409 [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API
- Open
- links to