Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10130

[C++][Dataset] ParquetFileFragment::SplitByRowGroup does not preserve "complete_metadata" status

Details

    Description

      Splitting a ParquetFileFragment in multiple fragments per row group (SplitByRowGroup) calls EnsureCompleteMetadata initially, but then the created fragments afterwards don't have the has_complete_metadata_ property set. This means that when calling EnsureCompleteMetadata on the splitted fragments, it will read/parse the metadata again, instead of using the cached ones (which are already present).

      Small example to illustrate:

      In [1]: import pyarrow.dataset as ds
      
      In [2]: dataset = ds.parquet_dataset("nyc-taxi-data/dask-partitioned/_metadata", partitioning="hive")
      
      In [3]: rg_fragments = [rg for frag in dataset.get_fragments() for rg in frag.split_by_row_group()]
      
      In [4]: len(rg_fragments)
      Out[4]: 4520
      
      # row group fragments actually have statistics
      In [7]: rg_fragments[0].row_groups[0].statistics
      Out[7]: 
      {'vendor_id': {'min': '1', 'max': '4'},
       'pickup_at': {'min': datetime.datetime(2009, 1, 1, 0, 5, 51),
        'max': datetime.datetime(2018, 12, 26, 14, 48, 54)},
      ...
      
      # but calling ensure_complete_metadata still takes a lot of time the first call
      In [8]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
      CPU times: user 1.72 s, sys: 203 ms, total: 1.92 s
      Wall time: 1.9 s
      
      In [9]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
      CPU times: user 1.34 ms, sys: 0 ns, total: 1.34 ms
      Wall time: 1.35 ms
      

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h

                  Slack

                    Issue deployment