Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-13585

Python SDK S3 reader 2x - 10x performance opportunity

Details

    • Improvement
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.34.0, 2.35.0, 2.36.0
    • 2.37.0
    • sdk-py-core
    • None

    Description

      This is an "after-the-fact" Jira issue for my GitHub PR to make S3 streaming in the Python SDK vastly more efficient.

      The issue with the old implementation was that a new connection was opened for each range request, which is very inefficient for both the client and the server, adding a lot of unnecessary latency. The new implementation tries to reused an existing connection and continues reading from the same HTTP stream if possible.

      Speed gain: 1.7-12x in benchmarks, more like 10x in real-word applications.

      Attachments

        Issue Links

          Activity

            People

              phoerious Janek Bevendorff
              phoerious Janek Bevendorff
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m