Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16258

Automatically append the grouping keys in SparkR's gapply

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • SparkR

    Description

      While working on the group apply function for python [1], we found it easier to depart from SparkR's gapply function in the following way:

      • the keys are appended by default to the spark dataframe being returned
      • the output schema that the users provides is the schema of the R data frame and does not include the keys

      Here are the reasons for doing so:

      • in most cases, users will want to know the key associated with a result -> appending the key is the sensible default
      • most functions in the SQL interface and in MLlib append columns, and gapply departs from this philosophy
      • for the cases when they do not need it, adding the key is a fraction of the computation time and of the output size
      • from a formal perspective, it makes calling gapply fully transparent to the type of the key: it is easier to build a function with gapply because it does not need to know anything about the key

      This ticket proposes to change SparkR's gapply function to follow the same convention as Python's implementation.

      cc Narine shivaram

      [1] https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py

      Attachments

        Activity

          People

            Unassigned Unassigned
            timhunter Timothy Hunter
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: