Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11806 Spark 2.0 deprecations and removals
  3. SPARK-3369

Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator

    XMLWordPrintableJSON

    Details

    • Target Version/s:

      Description

      mapPartitions in the Scala RDD API takes a function that transforms an Iterator to an Iterator: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

      In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an Iterator but is requires to return an Iterable, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself.

      Similarity for other mapPartitions* methods and other {{*FlatMapFunctions}}s in Java.

      (Is there a reason for this difference that I'm overlooking?)

      If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of:

      Promise that Spark will only call iterator() once, so implementors can use a hacky IteratorIterable that returns the same Iterator.

      Or, make a series of methods accepting a FlatMapFunction2, etc. with the desired signature, and deprecate existing ones.

        Attachments

          Activity

            People

            • Assignee:
              srowen Sean R. Owen
              Reporter:
              srowen Sean R. Owen
            • Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: