Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
None
-
None
Description
While working on the group apply function for python [1], we found it easier to depart from SparkR's gapply function in the following way:
- the keys are appended by default to the spark dataframe being returned
- the output schema that the users provides is the schema of the R data frame and does not include the keys
Here are the reasons for doing so:
- in most cases, users will want to know the key associated with a result -> appending the key is the sensible default
- most functions in the SQL interface and in MLlib append columns, and gapply departs from this philosophy
- for the cases when they do not need it, adding the key is a fraction of the computation time and of the output size
- from a formal perspective, it makes calling gapply fully transparent to the type of the key: it is easier to build a function with gapply because it does not need to know anything about the key
This ticket proposes to change SparkR's gapply function to follow the same convention as Python's implementation.
[1] https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py