Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.2, 1.2.1
Description
mapPartitions in the Scala RDD API takes a function that transforms an Iterator to an Iterator: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an Iterator but is requires to return an Iterable, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself.
Similarity for other mapPartitions* methods and other {{*FlatMapFunctions}}s in Java.
(Is there a reason for this difference that I'm overlooking?)
If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of:
Promise that Spark will only call iterator() once, so implementors can use a hacky IteratorIterable that returns the same Iterator.
Or, make a series of methods accepting a FlatMapFunction2, etc. with the desired signature, and deprecate existing ones.