Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8658

[C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments

    XMLWordPrintableJSON

Details

    Description

      This is a very handy optimization for large datasets with multiple partition fields. For example, given a hive-style directory $base_dir/a=3/ and a filter "a"_ == 2 none of its files or subdirectories need be examined.

      After ARROW-8318 FileSystemDataset stores only files so subtree pruning (whose implementation depended on the presence of directories to represent subtrees) was disabled. It should be possible to reintroduce this without reference to directories by examining partition expressions directly and extracting a tree structure from their subexpressions.

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              bkietz Ben Kietzman
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 10m
                  5h 10m