[ARROW-10130] [C++][Dataset] ParquetFileFragment::SplitByRowGroup does not preserve "complete_metadata" status - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/26142

Description

Splitting a ParquetFileFragment in multiple fragments per row group (SplitByRowGroup) calls EnsureCompleteMetadata initially, but then the created fragments afterwards don't have the has_complete_metadata_ property set. This means that when calling EnsureCompleteMetadata on the splitted fragments, it will read/parse the metadata again, instead of using the cached ones (which are already present).

Small example to illustrate:

In [1]: import pyarrow.dataset as ds

In [2]: dataset = ds.parquet_dataset("nyc-taxi-data/dask-partitioned/_metadata", partitioning="hive")

In [3]: rg_fragments = [rg for frag in dataset.get_fragments() for rg in frag.split_by_row_group()]

In [4]: len(rg_fragments)
Out[4]: 4520

# row group fragments actually have statistics
In [7]: rg_fragments[0].row_groups[0].statistics
Out[7]: 
{'vendor_id': {'min': '1', 'max': '4'},
 'pickup_at': {'min': datetime.datetime(2009, 1, 1, 0, 5, 51),
  'max': datetime.datetime(2018, 12, 26, 14, 48, 54)},
...

# but calling ensure_complete_metadata still takes a lot of time the first call
In [8]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
CPU times: user 1.72 s, sys: 203 ms, total: 1.92 s
Wall time: 1.9 s

In [9]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
CPU times: user 1.34 ms, sys: 0 ns, total: 1.34 ms
Wall time: 1.35 ms

Attachments

Issue Links

links to

GitHub Pull Request #8298

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Sep/20 11:49

Updated:: 11/Jan/23 08:11

Resolved:: 30/Sep/20 08:29

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

[C++][Dataset] ParquetFileFragment::SplitByRowGroup does not preserve "complete_metadata" status