Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16272

[C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

Details

    Description

      `pyarrow.fs.S3FileSystem.open_input_file` and `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with Pandas' `read_csv`.

      import pandas as pd
      import time
      from pyarrow.fs import S3FileSystem
      
      def load_parking_tickets():
          print("Running...")
          t0 = time.time()
          fs = S3FileSystem(
              anonymous=True,
              region="us-east-2",
              endpoint_override=None,
              proxy_options=None,
          )
      
          print("Time to create fs: ", time.time() - t0)
          t0 = time.time()
          # fhandler = fs.open_input_stream(
          #     "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
          # )
          fhandler = fs.open_input_file(
              "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
          )
          print("Time to create fhandler: ", time.time() - t0)
          t0 = time.time()
          year_2016_df = pd.read_csv(
              fhandler,
              nrows=100,
          )
          print("read time:", time.time() - t0)
          return year_2016_df
      
      t0 = time.time()
      load_parking_tickets()
      print("total time:", time.time() - t0)
      

      Output:

      Running...
      Time to create fs:  0.0003612041473388672
      Time to create fhandler:  0.22461509704589844
      read time: 105.76488208770752
      total time: 105.99135684967041
      

      This is with `pandas==1.4.2`.

      Getting similar performance with `fs.open_input_stream` as well (commented out in the code).

      Running...
      Time to create fs:  0.0002570152282714844
      Time to create fhandler:  0.18540692329406738
      read time: 186.8419930934906
      total time: 187.03169012069702
      

      When running it with just pandas (which uses `s3fs` under the hood), it's much faster:

      import pandas as pd
      import time
      
      def load_parking_tickets():
          print("Running...")
          t0 = time.time()
          year_2016_df = pd.read_csv(
              "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
              nrows=100,
          )
          print("read time:", time.time() - t0)
          return year_2016_df
      
      t0 = time.time()
      load_parking_tickets()
      print("total time:", time.time() - t0)
      

      Output:

      Running...
      read time: 1.1012001037597656
      total time: 1.101264238357544
      

      Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs performance:

      import pandas as pd
      import time
      from pyarrow.fs import S3FileSystem
      from fsspec.implementations.arrow import ArrowFSWrapper
      
      def load_parking_tickets():
          print("Running...")
          t0 = time.time()
          fs = ArrowFSWrapper(
              S3FileSystem(
                  anonymous=True,
                  region="us-east-2",
                  endpoint_override=None,
                  proxy_options=None,
              )
          )
      
          print("Time to create fs: ", time.time() - t0)
          t0 = time.time()
          fhandler = fs._open(
              "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
          )
          print("Time to create fhandler: ", time.time() - t0)
          t0 = time.time()
          year_2016_df = pd.read_csv(
              fhandler,
              nrows=100,
          )
          print("read time:", time.time() - t0)
          return year_2016_df
      
      t0 = time.time()
      load_parking_tickets()
      print("total time:", time.time() - t0)
      

      Output:

      Running...
      Time to create fs:  0.0002467632293701172
      Time to create fhandler:  0.1858382225036621
      read time: 0.13701486587524414
      total time: 0.3232450485229492
      

      Packages:

      pyarrow=7.0.0
      pandas : 1.4.2
      numpy : 1.20.3
      

      I tested it with 4.0.1, 5.0.0 as well and saw similar results.

      Attachments

        Activity

          People

            apitrou Antoine Pitrou
            sahil1105 Sahil Gupta
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 20m
                2h 20m

                Slack

                  Issue deployment