Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.1, 5.0.0, 7.0.0
-
MacOS 12.1
MacBook Pro
Intel x86
Description
`pyarrow.fs.S3FileSystem.open_input_file` and `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with Pandas' `read_csv`.
import pandas as pd import time from pyarrow.fs import S3FileSystem def load_parking_tickets(): print("Running...") t0 = time.time() fs = S3FileSystem( anonymous=True, region="us-east-2", endpoint_override=None, proxy_options=None, ) print("Time to create fs: ", time.time() - t0) t0 = time.time() # fhandler = fs.open_input_stream( # "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", # ) fhandler = fs.open_input_file( "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", ) print("Time to create fhandler: ", time.time() - t0) t0 = time.time() year_2016_df = pd.read_csv( fhandler, nrows=100, ) print("read time:", time.time() - t0) return year_2016_df t0 = time.time() load_parking_tickets() print("total time:", time.time() - t0)
Output:
Running... Time to create fs: 0.0003612041473388672 Time to create fhandler: 0.22461509704589844 read time: 105.76488208770752 total time: 105.99135684967041
This is with `pandas==1.4.2`.
Getting similar performance with `fs.open_input_stream` as well (commented out in the code).
Running... Time to create fs: 0.0002570152282714844 Time to create fhandler: 0.18540692329406738 read time: 186.8419930934906 total time: 187.03169012069702
When running it with just pandas (which uses `s3fs` under the hood), it's much faster:
import pandas as pd import time def load_parking_tickets(): print("Running...") t0 = time.time() year_2016_df = pd.read_csv( "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", nrows=100, ) print("read time:", time.time() - t0) return year_2016_df t0 = time.time() load_parking_tickets() print("total time:", time.time() - t0)
Output:
Running... read time: 1.1012001037597656 total time: 1.101264238357544
Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs performance:
import pandas as pd import time from pyarrow.fs import S3FileSystem from fsspec.implementations.arrow import ArrowFSWrapper def load_parking_tickets(): print("Running...") t0 = time.time() fs = ArrowFSWrapper( S3FileSystem( anonymous=True, region="us-east-2", endpoint_override=None, proxy_options=None, ) ) print("Time to create fs: ", time.time() - t0) t0 = time.time() fhandler = fs._open( "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", ) print("Time to create fhandler: ", time.time() - t0) t0 = time.time() year_2016_df = pd.read_csv( fhandler, nrows=100, ) print("read time:", time.time() - t0) return year_2016_df t0 = time.time() load_parking_tickets() print("total time:", time.time() - t0)
Output:
Running... Time to create fs: 0.0002467632293701172 Time to create fhandler: 0.1858382225036621 read time: 0.13701486587524414 total time: 0.3232450485229492
Packages:
pyarrow=7.0.0 pandas : 1.4.2 numpy : 1.20.3
I tested it with 4.0.1, 5.0.0 as well and saw similar results.