Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13369

[C++][python] performance of read_table using filters on a partitioned parquet file

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 4.0.0
    • None
    • C++, Python
    • None

    Description

      Reading a single partition of a parquet file via filters is significantly slower than reading the partition directly.

      import pandas as pd
      size = 100_000
      df = pd.DataFrame({'a': [1, 2, 3] * size, 'b': [4, 5, 6] * size})
      df.to_parquet('test.parquet', partition_cols=['a'])
      %timeit pyarrow.parquet.read_table('test.parquet/a=1')
      %timeit pyarrow.parquet.read_table('test.parquet', filters=[('a', '=', 1)])
      

      gives the timings

      2.57 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      5.18 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

      Likewise, changing size to 1_000_000 in the above code gives

      16.3 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      32.7 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

      Part of the docs for read_table states:

      > Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows.

      From this, I expected the performance to be roughly the same. 

      Attachments

        Activity

          People

            Unassigned Unassigned
            rhshadrach Richard Shadrach
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: