[BEAM-2572] Implement an S3 filesystem for Python SDK - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Triage Needed
Priority: P3
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.19.0
Component/s: sdk-py-core
Labels:
- GSoC2019
- gsoc
- gsoc2019
- mentor
- outreachy19dec

Description

There are two paths worth exploring, to my understanding:

1. Sticking to the HDFS-based approach (like it's done in Java).
2. Using boto/boto3 for accessing S3 through its common API endpoints.

I personally prefer the second approach, for a few reasons:

1. In real life, HDFS and S3 have different consistency guarantees, therefore their behaviors may contradict each other in some edge cases (say, we write something to S3, but it's not immediately accessible for reading from another end).

2. There are other AWS-based sources and sinks we may want to create in the future: DynamoDB, Kinesis, SQS, etc.

3. boto3 already provides somewhat good logic for basic things like reattempting.

Whatever path we choose, there's another problem related to this: we currently cannot pass any global settings (say, pipeline options, or just an arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the runner nodes to have AWS keys set up in the environment, which is not trivial to achieve and doesn't look too clean either (I'd rather see one single place for configuring the runner options).

Also, it's worth mentioning that I already have a janky S3 filesystem implementation that only supports DirectRunner at the moment (because of the previous paragraph). I'm perfectly fine finishing it myself, with some guidance from the maintainers.

Where should I move on from here, and whose input should I be looking for?

Thanks!

Attachments

Issue Links

is related to

BEAM-2492 Have PipelineOptions DisplayData filter out attributes marked with @org.apache.beam.sdk.options.Hidden

Resolved

relates to

BEAM-9094 Support setting some options such as endpoint_url and credential infos for AWS S3 Filesystem in Python SDKs

Open

links to

GitHub Pull Request #9955

GitHub Pull Request #11260

Activity

People

Assignee:: Unassigned

Reporter:: Dmitry Demeshchuk

Votes:: 6 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 07/Jul/17 21:05

Updated:: 13/Apr/23 11:00

Resolved:: 13/Jan/20 18:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: