Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.1.1
Description
SPARK-3369 corrected an old oversight in the Java API, wherein FlatMapFunction required an Iterable rather than Iterator. As reported by akrim, it seems that this same type of problem was overlooked also in JavaPairRDD (https://github.com/apache/spark/blob/6c00c069e3c3f5904abd122cea1d56683031cca0/core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala#L677 ):
def flatMapValues[U](f: JFunction[V, java.lang.Iterable[U]]): JavaPairRDD[K, U] =
As in PairRDDFunctions.scala, whose flatMapValues operates on TraversableOnce, this should really take a function that returns an Iterator – really, FlatMapFunction.
We can easily add an overload and deprecate the existing method.
def flatMapValues[U](f: FlatMapFunction[V, U]): JavaPairRDD[K, U]
This is source- and binary-backwards-compatible, in Java 7. It's binary-backwards-compatible in Java 8, but not source-compatible. The following natural usage with Java 8 lambdas becomes ambiguous and won't compile – Java won't figure out which to implement even based on the return type unfortunately:
JavaPairRDD<Integer, String> pairRDD = ... JavaPairRDD<Integer, Integer> mappedRDD = pairRDD.flatMapValues(s -> Arrays.asList(s.length()).iterator());
It can be resolved by explicitly casting the lambda.
We can at least document this. One day in Spark 3.x this can just be changed outright.
It's conceivable to resolve this by making the new method called "flatMapValues2" or something ugly.