Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6876

[Python] Reading parquet file with many columns becomes slow for 0.15.0

    XMLWordPrintableJSON

Details

    Description

      Hi,

       

      I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas.

       

      Example:

      With 0.14.1
      In [4]: %timeit df = pd.read_parquet(path)
      2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

      With 0.15.0
      In [5]: %timeit df = pd.read_parquet(path)
      22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

       

      The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas.

       

      Have you received similar complain? What could be the issue here?

       

      Thanks a lot.

       

       

      Edit1:

      Some profiling I did:

      0.14.1:

       

      0.15.0:

       

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              dorafmon Bob
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m