Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
8.0.0
Description
Hello!
I've noticed that when writing a `_metadata` file with `pyarrow.parquet.write_metadata`, it is very slow with a large `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that the concatenation inside `metadata.append_row_groups` is very slow. The writer first iterates over every item of the list and then concatenates them on each iteration.
Would it be possible to make a vectorized implementation of this? Where `append_row_groups` accepts a list of `FileMetaData` objects, and where concatenation happens only once?
Repro (in IPython to use `%time`)
from io import BytesIO import pyarrow as pa import pyarrow.parquet as pq def create_example_file_meta_data(): data = { "str": pa.array(["a", "b", "c", "d"], type=pa.string()), "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()), "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()), "bool": pa.array([True, True, False, False], type=pa.bool_()), } table = pa.table(data) metadata_collector = [] pq.write_table(table, BytesIO(), metadata_collector=metadata_collector) return table.schema, metadata_collector[0] schema, meta = create_example_file_meta_data() metadata_collector = [meta] * 500 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms # Wall time: 234 ms metadata_collector = [meta] * 1000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms # Wall time: 970 ms metadata_collector = [meta] * 2000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s # Wall time: 4.3 s metadata_collector = [meta] * 4000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s # Wall time: 17.3 s