[BEAM-13585] Python SDK S3 reader 2x - 10x performance opportunity - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: P2
Resolution: Fixed
Affects Version/s: 2.34.0, 2.35.0, 2.36.0
Fix Version/s: 2.37.0
Component/s: sdk-py-core
Labels:
None

Description

This is an "after-the-fact" Jira issue for my GitHub PR to make S3 streaming in the Python SDK vastly more efficient.

The issue with the old implementation was that a new connection was opened for each range request, which is very inefficient for both the client and the server, adding a lot of unnecessary latency. The new implementation tries to reused an existing connection and continues reading from the same HTTP stream if possible.

Speed gain: 1.7-12x in benchmarks, more like 10x in real-word applications.

Attachments

Issue Links

links to

GitHub Pull Request #15931

Activity

People

Assignee:: Janek Bevendorff

Reporter:: Janek Bevendorff

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Dec/21 08:58

Updated:: 11/Jan/22 13:35

Resolved:: 11/Jan/22 13:34

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m