Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8065

[C++][Dataset] Untangle Dataset, Fragment and ScanOptions

    XMLWordPrintableJSON

Details

    Description

      Currently: a fragment is a product of a scan; it is a lazy collection of scan tasks corresponding to a data source which is logically singular (like a single file, a single row group, ...). It would be more useful if instead a fragment were the direct object of a scan; one scans a fragment (or a collection of fragments):

      1. Remove ScanOptions from Fragment's properties and move it into Fragment::Scan parameters.
      2. Remove ScanOptions from Dataset::GetFragments. We can provide an overload to support predicate pushdown in FileSystemDataset and UnionDataset Dataset::GetFragments(std::shared_ptr<Expression> predicate).
      3. Expose lazy accessor to Fragment::physical_schema()
      4. Consolidate ScanOptions and ScanContext

      This will lessen the cognitive dissonance between fragments and files since fragments will no longer include references to scan properties.

      Attachments

        Issue Links

          Activity

            People

              fsaintjacques Francois Saint-Jacques
              fsaintjacques Francois Saint-Jacques
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m