Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11806 Spark 2.0 deprecations and removals
  3. SPARK-3369

Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator

    XMLWordPrintableJSON

Details

    Description

      mapPartitions in the Scala RDD API takes a function that transforms an Iterator to an Iterator: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

      In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an Iterator but is requires to return an Iterable, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself.

      Similarity for other mapPartitions* methods and other {{*FlatMapFunctions}}s in Java.

      (Is there a reason for this difference that I'm overlooking?)

      If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of:

      Promise that Spark will only call iterator() once, so implementors can use a hacky IteratorIterable that returns the same Iterator.

      Or, make a series of methods accepting a FlatMapFunction2, etc. with the desired signature, and deprecate existing ones.

      Attachments

        Activity

          People

            srowen Sean R. Owen
            srowen Sean R. Owen
            Votes:
            3 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: