Details
-
Documentation
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
None
-
None
-
None
Description
Spark documentation should be more precise about the algebraic properties of functions in various transformations. The way the current documentation is written is potentially confusing. For example, in Spark 1.6, the scaladoc for reduce in RDD says:
> Reduces the elements of this RDD using the specified commutative and associative binary operator.
This is precise and accurate. In the documentation of reduceByKey in PairRDDFunctions, on the other hand, it says:
> Merge the values for each key using an associative reduce function.
To be more precise, this function must also be commutative in order for the computation to be correct. Writing commutative for reduce and not reduceByKey gives the false impression that the function in the latter does not need to be commutative.
The same applies to aggregateByKey. To be precise, both seqOp and combOp need to be associative (mentioned) AND commutative (not mentioned) in order for the computation to be correct. It would be desirable to fix these inconsistencies throughout the documentation.
Attachments
Issue Links
- duplicates
-
SPARK-13339 Clarify commutative / associative operator requirements for reduce, fold
- Resolved