Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13340

[C++][Dataset] Simplify ScanOptions after complexity has moved to ScanNode

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      ScanOptions currently has a number of constraints between members, which violates the contract of a public struct:

      • filter must be bound to dataset_schema
      • projection must be bound to dataset_schema
      • projected_schema must be schema<...fields>, where the type of projection is struct<...fields>

      These are currently required to support FilterAndProjectScanTask, but after ARROW-13328 this complexity can be removed and ScanOptions can be a pure struct argument to MakeScanNode. Specifically, it should be possible to:

      • remove the projected_schema field (ScanNode doesn't need to know the schemas of any subsequent nodes)
      • remove the projection field (ScanNode doesn't need to know how or if scanned batches will be projected)
      • provide a simple vector of FieldRef to indicate which fields should be materialized (MakeScanNode can validate that this includes every field referenced by filter)
      • allow filter to be unbound (MakeScanNode can bind it to the dataset schema)

      dataset_schema seems slightly redundant too since MakeScanNode also takes a Dataset as an argument but it is currently used by CsvFileFormat to derive column types

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bkietz Ben Kietzman
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: