Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6985

[Python] Steadily increasing time to load file using read_parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Cannot Reproduce
    • 0.13.0, 0.14.0, 0.15.0
    • None
    • Python
    • None

    Description

      I've noticed that reading from parquet using pandas read_parquet function is taking steadily longer with each invocation. I've seen the other ticket about memory usage but I'm seeing no memory impact just steadily increasing read time until I restart the python session.

      Below is some code to reproduce my results. I notice it's particularly bad on wide matrices, especially using pyarrow==0.15.0

      import pyarrow.parquet as pq
      import pyarrow as pa
      import pandas as pd
      import os
      import numpy as np
      import time
      
      file = "skinny_matrix.pq"
      
      if not os.path.isfile(file):
          mat = np.zeros((6000, 26000))
          mat.ravel()[::100] = np.random.randn(60 * 26000)
          df = pd.DataFrame(mat.T)
          table = pa.Table.from_pandas(df)
          pq.write_table(table, file)
      
      n_timings = 50
      timings = np.empty(n_timings)
      for i in range(n_timings):
          start = time.time()
          new_df = pd.read_parquet(file)
          end = time.time()
          timings[i] = end - start
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            CHDev93 Casey
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: