Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19287

JavaPairRDD flatMapValues requires function returning Iterable, not Iterator

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1.1
    • Fix Version/s: 3.0.0
    • Component/s: Java API
    • Labels:
    • Target Version/s:
    • Docs Text:
      Hide
      JavaPairRDD.flatMapValues() now requires a FlatMapFunction as an argument. This means this function now must return an Iterator, not Iterable. This corrects a long-standing inconsistency between the Scala and Java API, and allows the caller to supply merely an Iterator, not a full Iterable. Existing functions passed to this method can simply invoke ".iterator()" on their existing return value to comply with the new signature.
      Show
      JavaPairRDD.flatMapValues() now requires a FlatMapFunction as an argument. This means this function now must return an Iterator, not Iterable. This corrects a long-standing inconsistency between the Scala and Java API, and allows the caller to supply merely an Iterator, not a full Iterable. Existing functions passed to this method can simply invoke ".iterator()" on their existing return value to comply with the new signature.

      Description

      SPARK-3369 corrected an old oversight in the Java API, wherein FlatMapFunction required an Iterable rather than Iterator. As reported by Asher Krim, it seems that this same type of problem was overlooked also in JavaPairRDD (https://github.com/apache/spark/blob/6c00c069e3c3f5904abd122cea1d56683031cca0/core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala#L677 ):

      def flatMapValues[U](f: JFunction[V, java.lang.Iterable[U]]): JavaPairRDD[K, U] =
      

      As in PairRDDFunctions.scala, whose flatMapValues operates on TraversableOnce, this should really take a function that returns an Iterator – really, FlatMapFunction.

      We can easily add an overload and deprecate the existing method.

      def flatMapValues[U](f: FlatMapFunction[V, U]): JavaPairRDD[K, U]
      

      This is source- and binary-backwards-compatible, in Java 7. It's binary-backwards-compatible in Java 8, but not source-compatible. The following natural usage with Java 8 lambdas becomes ambiguous and won't compile – Java won't figure out which to implement even based on the return type unfortunately:

      JavaPairRDD<Integer, String> pairRDD = ...
      JavaPairRDD<Integer, Integer> mappedRDD = 
        pairRDD.flatMapValues(s -> Arrays.asList(s.length()).iterator());
      

      It can be resolved by explicitly casting the lambda.

      We can at least document this. One day in Spark 3.x this can just be changed outright.

      It's conceivable to resolve this by making the new method called "flatMapValues2" or something ugly.

        Attachments

          Activity

            People

            • Assignee:
              srowen Sean R. Owen
              Reporter:
              srowen Sean R. Owen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: