[SPARK-16258] Automatically append the grouping keys in SparkR's gapply - ASF JIRA

XML

Word

Printable

JSON

While working on the group apply function for python [1], we found it easier to depart from SparkR's gapply function in the following way:

the keys are appended by default to the spark dataframe being returned
the output schema that the users provides is the schema of the R data frame and does not include the keys

Here are the reasons for doing so:

in most cases, users will want to know the key associated with a result -> appending the key is the sensible default
most functions in the SQL interface and in MLlib append columns, and gapply departs from this philosophy
for the cases when they do not need it, adding the key is a fraction of the computation time and of the output size
from a formal perspective, it makes calling gapply fully transparent to the type of the key: it is easier to build a function with gapply because it does not need to know anything about the key

This ticket proposes to change SparkR's gapply function to follow the same convention as Python's implementation.