Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11469

[Python] Performance degradation parquet reading of wide dataframes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.0.0, 1.0.1, 2.0.0, 3.0.0
    • None
    • Python
    • None

    Description

      I noticed a relatively big performance degradation in version 1.0.0+ when trying to load wide dataframes.

      For example you should be able to reproduce by doing:

      import numpy as np
      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      df = pd.DataFrame(np.random.rand(100, 10000))
      table = pa.Table.from_pandas(df)
      pq.write_table(table, "temp.parquet")
      
      %timeit pd.read_parquet("temp.parquet")

      In version 0.17.0, this takes about 300-400 ms and for anything above and including 1.0.0, this suddenly takes around 2 seconds.

       

      Thanks for looking into this.

      Attachments

        1. profile_wide300.svg
          57 kB
          Joris Van den Bossche
        2. image-2021-05-03-14-40-09-520.png
          130 kB
          Elena Henderson
        3. image-2021-05-03-14-39-59-485.png
          327 kB
          Elena Henderson
        4. image-2021-05-03-14-31-41-260.png
          298 kB
          Elena Henderson

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Axelg1 Axel G
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: