Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45170

Scala-specific improvements in Dataset[T] API



    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.4.1
    • None
    • Spark Core


      Q1. What are you trying to do? 

      The main idea is to use the power of scala's macrosses to give developers more convenient and typesafe API to use in join conditions. 


      Q2. What problem is this proposal NOT designed to solve?

      R/Java/Python/DataFrame API is out of scope. The solution is not affecting plan generation too. 


      Q3. How is it done today, and what are the limits of current practice?

      Currently the join condition is specified via strings, which might lead to silly mistakes (typos, incompatible column types etc) and sometimes hard to read (in case when several joins are made and the final type is tuple of tuple of tuples...)


      Q4. What is new in your approach and why do you think it will be successful?

      Scala macroses can be used to extract the column name directly from lambda (extractor). As a side effect its possible to check the column type and prohibit to build inconsistent join expression (like boolean-timestamp comparison)


      Q5. Who cares? If you are successful, what difference will it make?

      Mainly scala developers who prefers typesafe code - they would have a more clean and nice API that will make the codebase a bit clearer, especially in case when several chained joins is used


      Q6. What are the risks?

      The overusage of macrosses may slow down the compilation speed. In additional macrosses are hard to maintain


      Q7. How long will it take?

      Currently the approach is already implemented as a separate lib that makes a bit more than just gives alternative API (for example abstracts Dataset[T] to F[T] which allows to run some spark-specific code without spark session for testing purposes)

      Adaptation of it won't be a hard job, matter of several weeks


      Q8. What are the mid-term and final “exams” to check for success?

      API convenience is very hard to estimate as its more or less a question of taste


      Appendix A

      You may find the examples of such 'cleaner' API here

      Note that backward and forward compatibility is achieved by introducing a brand-new API without modifying an old one





            Unassigned Unassigned
            salamahin Danila Goloshchapov
            3 Vote for this issue
            3 Start watching this issue