Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8062

[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

    XMLWordPrintableJSON

Details

    Description

      Partitioned parquet datasets sometimes come with _metadata / _common_metadata files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for _metadata).

      Using those files during the creation of a parquet Dataset can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory).

      Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed.
      Such logic could be put in a different factory class, eg ParquetManifestFactory (as suggestetd by fsaintjacques).

      Attachments

        Issue Links

          Activity

            People

              fsaintjacques Francois Saint-Jacques
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h
                  4h