[SPARK-45170] Scala-specific improvements in Dataset[T] API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.4.1
Fix Version/s: None
Component/s: Spark Core
Labels:
- SPIP

Language:
- Scala

Description

Q1. What are you trying to do?

The main idea is to use the power of scala's macrosses to give developers more convenient and typesafe API to use in join conditions.

Q2. What problem is this proposal NOT designed to solve?

R/Java/Python/DataFrame API is out of scope. The solution is not affecting plan generation too.

Q3. How is it done today, and what are the limits of current practice?

Currently the join condition is specified via strings, which might lead to silly mistakes (typos, incompatible column types etc) and sometimes hard to read (in case when several joins are made and the final type is tuple of tuple of tuples...)

Q4. What is new in your approach and why do you think it will be successful?

Scala macroses can be used to extract the column name directly from lambda (extractor). As a side effect its possible to check the column type and prohibit to build inconsistent join expression (like boolean-timestamp comparison)

Q5. Who cares? If you are successful, what difference will it make?

Mainly scala developers who prefers typesafe code - they would have a more clean and nice API that will make the codebase a bit clearer, especially in case when several chained joins is used

Q6. What are the risks?

The overusage of macrosses may slow down the compilation speed. In additional macrosses are hard to maintain

Q7. How long will it take?

Currently the approach is already implemented as a separate lib that makes a bit more than just gives alternative API (for example abstracts Dataset[T] to F[T] which allows to run some spark-specific code without spark session for testing purposes)

Adaptation of it won't be a hard job, matter of several weeks

Q8. What are the mid-term and final “exams” to check for success?

API convenience is very hard to estimate as its more or less a question of taste

Appendix A

You may find the examples of such 'cleaner' API here

Note that backward and forward compatibility is achieved by introducing a brand-new API without modifying an old one

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Danila Goloshchapov

Votes:: 4 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Sep/23 15:13

Updated:: 28/Oct/23 16:00