Description
Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends. Beam recently added support for launching jobs using Yaml on top of its other SDKs, this project would focus on adding more features and transforms to the Yaml SDK so that it can be the easiest way to define your data pipelines.
Objectives:
1. Add support for existing Beam transforms (IOs, Machine Learning transforms, and others) to the Yaml SDK
2. Add end to end pipeline use cases using the Yaml SDK
3. (stretch) Add Yaml SDK support to the Beam playground
Useful links:
Apache Beam repo - https://github.com/apache/beam
Yaml SDK code + docs - https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml
Open issues for the Yaml SDK - https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Ayaml