Details
Description
Motivation
Current the scala sql/core and connect API share the same API; connect implements a subset of the functionality of the sql/core API. The compatibility of the two implementations is enforced by MiMa checks.
While this sort of works for application development, it is not ideal for a couple of reasons:
- An application developer needs to pick against which API they are going to develop while setting up their project (they need to select the correct dependencies). While it is true, that they can this change later, it does put a mental burden on de the developer. A much preferred solution would be to defer binding to an implementation until you run the code.
- (Minor) the current setup confuses IDEs, and is more of a pain to work with especially for Spark developers.
- Developing and maintaining Spark API is more difficult because of the added burden of working with MiMa and/or adding the same API in more places.
- Connect testing is fairly anaemic. We have seen a couple of cases where connect behaves slightly different, and this could have been detected if connect was able to leverage Spark SQLs extensive testing.
Goals
- Create a truly shared Scala API with two implementations. The goal is not to replace/simplify/reduce the current sql/core API we all love, the interface will only support the API shared between the implementations. An implementation can provide additional functionality (e.g. RDD centric methods for the sql/core implementation).
- The common interface should cover all API supported by the current Connect Scala client.
- Maintain as much binary compatibility with previous Spark releases as possible
Design Notes
- We are going to try to make the interface very connect centric. Where possible we will implement functionality using the connect API.
- .... TBD
Attachments
Issue Links
- is related to
-
SPARK-44111 Prepare Apache Spark 4.0.0
- Open