Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Splitting a ParquetFileFragment in multiple fragments per row group (SplitByRowGroup) calls EnsureCompleteMetadata initially, but then the created fragments afterwards don't have the has_complete_metadata_ property set. This means that when calling EnsureCompleteMetadata on the splitted fragments, it will read/parse the metadata again, instead of using the cached ones (which are already present).
Small example to illustrate:
In [1]: import pyarrow.dataset as ds In [2]: dataset = ds.parquet_dataset("nyc-taxi-data/dask-partitioned/_metadata", partitioning="hive") In [3]: rg_fragments = [rg for frag in dataset.get_fragments() for rg in frag.split_by_row_group()] In [4]: len(rg_fragments) Out[4]: 4520 # row group fragments actually have statistics In [7]: rg_fragments[0].row_groups[0].statistics Out[7]: {'vendor_id': {'min': '1', 'max': '4'}, 'pickup_at': {'min': datetime.datetime(2009, 1, 1, 0, 5, 51), 'max': datetime.datetime(2018, 12, 26, 14, 48, 54)}, ... # but calling ensure_complete_metadata still takes a lot of time the first call In [8]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments] CPU times: user 1.72 s, sys: 203 ms, total: 1.92 s Wall time: 1.9 s In [9]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments] CPU times: user 1.34 ms, sys: 0 ns, total: 1.34 ms Wall time: 1.35 ms
Attachments
Issue Links
- links to