Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17913

feather.read_table 150x slower when reading columns in newer versions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.0.0, 8.0.0, 9.0.0
    • None
    • None
    • python 3.9, ubuntu 20.04

    Description

      Description

      Performance when reading columns using feather.read_table on Arrow 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.

      Profiling the code below shows that the bottleneck is somewhere in the read_names function of pyarrow._feather.FeatherReader.

      Example

      Setup code:

      import pandas as pd
      from pyarrow import feather
      
      rows, cols = (1_000_000, 10)
      data = {f'c{c}': range(rows) for c in range(cols)}
      df = pd.DataFrame(data=data)
      
      feather.write_feather(df, 'test.feather', compression="uncompressed")

      Benchmarks Arrow 9.0.0:

      %timeit feather.read_table('test.feather', memory_map=True)
      %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
      
      > 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
      33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
      

      Benchmarks Arrow 6.0.0:

      %timeit feather.read_table('test.feather', memory_map=True)
      %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
      
      > 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
      224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              hakonmh Håkon Magne Holmen
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: