There is a desire for third party language extensions for Apache Spark. Some notable examples include:
- C#/F# from project Mobius https://github.com/Microsoft/Mobius
- Haskell from project sparkle https://github.com/tweag/sparkle
- Julia from project Spark.jl https://github.com/dfdx/Spark.jl
Presently, Apache Spark supports Python and R via a tightly integrated interop layer. It would seem that much of that existing interop layer could be refactored into a clean surface for general (third party) language bindings, such as the above mentioned. More specifically, could we generalize the following modules:
- Deploy runners (e.g., PythonRunner and RRunner)
- DataFrame Executors
- RDD operations?
The last being questionable: integrating third party language extensions at the RDD level may be too heavy-weight and unnecessary given the preference towards the Dataframe abstraction.
The main goals of this effort would be:
- Provide a clean abstraction for third party language extensions making it easier to maintain (the language extension) with the evolution of Apache Spark
- Provide guidance to third party language authors on how a language extension should be implemented
- Provide general reusable libraries that are not specific to any language extension
- Open the door to developers that prefer alternative languages
- Identify and clean up common code shared between Python and R interops
Data Scientists, Data Engineers, Library Developers
Data scientists and engineers will have the opportunity to work with Spark in languages other than what’s natively supported. Library developers will be able to create language extensions for Spark in a clean way. The interop layer should also provide guidance for developing language extensions.
The proposal does not aim to create an actual language extension. Rather, it aims to provide a stable interop layer for third party language extensions to dock.
Much of the work will involve generalizing existing interop APIs for PySpark and R, specifically for the Dataframe API. For instance, it would be good to have a general deploy.Runner (similar to PythonRunner) for language extension efforts. In Spark SQL, it would be good to have a general InteropUDF and evaluator (similar to BatchEvalPythonExec).
Low-level RDD operations should not be needed in this initial offering; depending on the success of the interop layer and with proper demand, RDD interop could be added later. However, one open question is supporting a subset of low-level functions that are core to ETL e.g., transform.
The work would be broken down into two top-level phases:
Phase 1: Introduce general interop API for deploying a driver/application, running an interop UDF along with any other low-level transformations that aid with ETL.
Phase 2: Port existing Python and R language extensions to the new interop layer. This port should be contained solely to the Spark core side, and all protocols specific to Python and R should not change e.g., Python should continue to use py4j is the protocol between the Python process and core Spark. The port itself should be contained to a handful of files e.g., some examples for Python: PythonRunner, BatchEvalPythonExec, , PythonRDD (possibly), and will mostly involve refactoring common logic abstract implementations and utilities.
The clear alternative is the status quo; developers that want to provide a third-party language extension to Spark do so directly; often by extending existing Python classes and overriding the portions that are relevant to the new extension. Not only is this not sound code (e.g., an JuliaRDD is not a PythonRDD, which contains a lot of reusable code), but it runs the great risk of future revisions making the subclass implementation obsolete. It would be hard to imagine that any third-party language extension would be successful if there was not something in place to guarantee its long-term maintainability.
Another alternative is that third-party languages should only interact with Spark via pure-SQL; possibly via REST. However, this does not enable UDFs written in the third-party language; a key desideratum in this effort, which most notably takes the form of legacy code/UDFs that would need to be ported to a supported language e.g., Scala. This exercise is extremely cumbersome and not always feasible due to the code no longer being available i.e., only the compiled library exists.