Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10131

[C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment

    XMLWordPrintableJSON

Details

    Description

      Related to ARROW-9730, parsing of the statistics in parquet metadata is expensive, and therefore should be avoided when possible.

      For example, the ParquetDatasetFactory (ds.parquet_dataset() in python) parses all statistics of all files and all columns. While when doing a filtered read, you might only need the statistics of certain files (eg if a filter on a partition field already excluded many files) and certain columns (eg only the columns on which you are actually filtering).

      The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a later EnsureCompleteMetadata parse all statistics, and don't allow parsing a subset, or only parsing the other (non-statistics) metadata, ...), so I think we should try to think of better abstractions.

      cc rjzamora bkietz

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m