Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16613

[Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

    XMLWordPrintableJSON

Details

    Description

      Hello!

       

      I've noticed that when writing a `_metadata` file with `pyarrow.parquet.write_metadata`, it is very slow with a large `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that the concatenation inside `metadata.append_row_groups` is very slow. The writer first iterates over every item of the list and then concatenates them on each iteration.

       

      Would it be possible to make a vectorized implementation of this? Where `append_row_groups` accepts a list of `FileMetaData` objects, and where concatenation happens only once?

       

      Repro (in IPython to use `%time`)

      from io import BytesIO
      
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      def create_example_file_meta_data():
          data = {
              "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
              "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
              "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
              "bool": pa.array([True, True, False, False], type=pa.bool_()),
          }
          table = pa.table(data)
          metadata_collector = []
          pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
          return table.schema, metadata_collector[0]
      
      schema, meta = create_example_file_meta_data()
      
      metadata_collector = [meta] * 500
      %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
      # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
      # Wall time: 234 ms
      
      metadata_collector = [meta] * 1000
      %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
      # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
      # Wall time: 970 ms
      
      metadata_collector = [meta] * 2000
      %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
      # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
      # Wall time: 4.3 s
      
      metadata_collector = [meta] * 4000
      %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
      # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
      # Wall time: 17.3 s
      

      Attachments

        Activity

          People

            apitrou Antoine Pitrou
            kylebarron2 Kyle Barron
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h
                3h