Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6876

[Python] Reading parquet file with many columns becomes slow for 0.15.0

    XMLWordPrintableJSON

    Details

      Description

      Hi,

       

      I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas.

       

      Example:

      With 0.14.1
      In [4]: %timeit df = pd.read_parquet(path)
      2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

      With 0.15.0
      In [5]: %timeit df = pd.read_parquet(path)
      22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

       

      The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas.

       

      Have you received similar complain? What could be the issue here?

       

      Thanks a lot.

       

       

      Edit1:

      Some profiling I did:

      0.14.1:

       

      0.15.0:

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesm Wes McKinney
                Reporter:
                dorafmon Bob
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m