[SPARK-3369] Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.2, 1.2.1
Fix Version/s: 2.0.0
Component/s: Java API
Labels:
- breaking_change
- releasenotes

Target Version/s:

2.0.0

Description

mapPartitions in the Scala RDD API takes a function that transforms an Iterator to an Iterator: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an Iterator but is requires to return an Iterable, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself.

Similarity for other mapPartitions* methods and other {{*FlatMapFunctions}}s in Java.

(Is there a reason for this difference that I'm overlooking?)

If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of:

Promise that Spark will only call iterator() once, so implementors can use a hacky IteratorIterable that returns the same Iterator.

Or, make a series of methods accepting a FlatMapFunction2, etc. with the desired signature, and deprecate existing ones.

Attachments

Issue Links

links to

[Github] Pull Request #10413 (srowen)

Activity

People

Assignee:: Sean R. Owen

Reporter:: Sean R. Owen

Votes:: 3 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 03/Sep/14 08:50

Updated:: 26/Jan/16 11:55

Resolved:: 26/Jan/16 11:55