Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7556

[Python][Parquet] Performance regression in pyarrow-0.15.1 vs pyarrow-0.12.1 when reading a "partitioned parquet table" ?

Add voteWatch issue
    XMLWordPrintableJSON

Details

    Description

      Hi,

      I am currently running a small test with pyarrow to load 2 "partitioned" parquet tables.

      The performance seems to be 2 times less with pyarrow-0.15.1 than with pyarrow-0.12.1.

      In my test:

      • the 'parquet' tables I am loading are called 'reports' & 'memory'
      • they have been generated through pandas.to_parquet by specifying the partitions columns
      • they are both partitioned by 2 columns 'p_type' and 'p_start'
      • it is small tables:
        • reports
          • 90 partitions (1 parquet file / partition)
          • total size: 6.2MB
        • memory
          • 105 partitions (1 parquet file / partition)
          • total size: 9.1MB

       

      Here is the code of my simple test that tries to read them (I'm using a filter on the p_start partition):

       

      // code placeholder
      
      import os
      import sys
      import time
      import pyarrow
      from pyarrow.parquet import ParquetDataset
      def load_dataframe(data_dir, table, start_date, end_date):
          return ParquetDataset(os.path.join(data_dir, table),
                                filters=[('p_start', '>=', start_date),
                                         ('p_start', '<=', end_date)
                                         ]).read().to_pandas()
      
      print(f'pyarrow version;{pyarrow.__version__}')
      
      data_dir = sys.argv[1]
      
      for i in range(1, 10):
          start = time.time()
          start_date = '201912230000'
          end_date = '202001080000'
          load_dataframe(sys.argv[1], 'reports', start_date, end_date)
          load_dataframe(sys.argv[1], 'memory', start_date, end_date)
          print(f'loaded;in;{time.time()-start}')
      
      

       

       

      Here are the results:

      • with pyarrow-0.12.1

      $ python -m cProfile -o load_pyarrow_0.12.1.cprof load_df_from_pyarrow.py parquet/

      pyarrow version;0.12.1

      loaded;in;0.5566098690032959
      loaded;in;0.32605648040771484
      loaded;in;0.28951501846313477
      loaded;in;0.29279112815856934
      loaded;in;0.3474299907684326
      loaded;in;0.4075736999511719
      loaded;in;0.425199031829834
      loaded;in;0.34653329849243164
      loaded;in;0.300839900970459

      (~350ms to load the 2 tables)

       

       

      • with pyarrow-0.15.1

       

      $ python -m cProfile -o load_pyarrow_0.15.1.cprof load_df_from_pyarrow.py parquet/

      pyarrow version;0.15.1
      loaded;in;1.1126022338867188
      loaded;in;0.8931224346160889
      loaded;in;1.3298325538635254
      loaded;in;0.8584625720977783
      loaded;in;0.9232609272003174
      loaded;in;1.0619215965270996
      loaded;in;0.8619768619537354
      loaded;in;0.8686420917510986
      loaded;in;1.1183602809906006

      (>800ms to load the 2 tables)

       

       

      Is there a performance regression here ?

      Am I missing something ?

       

      In attachment, you can find the 2 .cprof files.

      load_pyarrow_0.12.1.cprof

      load_pyarrow_0.15.1.cprof

      Attachments

        1. arrow-7756.tar.gz
          8.65 MB
          Julien MASSIOT
        2. load_df_from_pyarrow.py
          0.7 kB
          Julien MASSIOT
        3. load_df_pyarrow-0.12.1.png
          166 kB
          Julien MASSIOT
        4. load_df_pyarrow-0-15.1.png
          167 kB
          Julien MASSIOT
        5. load_pyarrow_0.12.1.cprof
          423 kB
          Julien MASSIOT
        6. load_pyarrow_0.15.1.cprof
          420 kB
          Julien MASSIOT

        Activity

          People

            Unassigned Unassigned
            jmassiot Julien MASSIOT

            Dates

              Created:
              Updated:

              Slack

                Issue deployment