Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2656

[Python] Improve ParquetManifest creation time

    XMLWordPrintableJSON

    Details

      Description

      When a parquet dataset is highly partitioned, the time to call the constructor for ParquetManifest takes a significant amount of time since it serially visits directories to find all parquet files. In a dataset with thousands of partition values this can take several minutes from a personal laptop.

      A quick win to vastly improve this performance would be to use a ThreadPool to have calls to _visit_level happen concurrently to prevent wasting a ton of time waiting on I/O.

      An even faster option could be to allow for optional indexing of dataset metadata in something like the common_metadata. This could contain all files in the manifest and their row_group information. This would also allow for split_row_groups to be implemented efficiently without needing to open every parquet file in the dataset to retrieve the metadata which is quite time consuming for large datasets. The main problem with the indexing approach are it requires immutability of the dataset, which doesn't seem too unreasonable. This specific implementation seems related to https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the write portion.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                rgruener Robbie Gruener
                Reporter:
                rgruener Robbie Gruener
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 20m
                  5h 20m