Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
ScanOptions currently has a number of constraints between members, which violates the contract of a public struct:
- filter must be bound to dataset_schema
- projection must be bound to dataset_schema
- projected_schema must be schema<...fields>, where the type of projection is struct<...fields>
These are currently required to support FilterAndProjectScanTask, but after ARROW-13328 this complexity can be removed and ScanOptions can be a pure struct argument to MakeScanNode. Specifically, it should be possible to:
- remove the projected_schema field (ScanNode doesn't need to know the schemas of any subsequent nodes)
- remove the projection field (ScanNode doesn't need to know how or if scanned batches will be projected)
- provide a simple vector of FieldRef to indicate which fields should be materialized (MakeScanNode can validate that this includes every field referenced by filter)
- allow filter to be unbound (MakeScanNode can bind it to the dataset schema)
dataset_schema seems slightly redundant too since MakeScanNode also takes a Dataset as an argument but it is currently used by CsvFileFormat to derive column types
Attachments
Issue Links
- depends upon
-
ARROW-13328 [C++][Dataset] Use an ExecPlan for synchronous scans or drop synchronous scans
- Closed