[SPARK-12844] Spark documentation should be more precise about the algebraic properties of functions in various transformations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Documentation
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: Documentation
Labels:
None

Description

Spark documentation should be more precise about the algebraic properties of functions in various transformations. The way the current documentation is written is potentially confusing. For example, in Spark 1.6, the scaladoc for reduce in RDD says:

> Reduces the elements of this RDD using the specified commutative and associative binary operator.

This is precise and accurate. In the documentation of reduceByKey in PairRDDFunctions, on the other hand, it says:

> Merge the values for each key using an associative reduce function.

To be more precise, this function must also be commutative in order for the computation to be correct. Writing commutative for reduce and not reduceByKey gives the false impression that the function in the latter does not need to be commutative.

The same applies to aggregateByKey. To be precise, both seqOp and combOp need to be associative (mentioned) AND commutative (not mentioned) in order for the computation to be correct. It would be desirable to fix these inconsistencies throughout the documentation.

Attachments

Issue Links

duplicates

SPARK-13339 Clarify commutative / associative operator requirements for reduce, fold

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jimmy Lin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Jan/16 21:38

Updated:: 06/Sep/16 09:53

Resolved:: 06/Sep/16 09:53