Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13378

Add tee method to RDD

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.6.0
    • None
    • Spark Core
    • None

    Description

      In our application, we sometimes need to save the partial/intermediate results to side files in the middle of a data pipeline/DAG. The only way now to do this is to use saveAsTextFile method which only runs at the end of a pipeline. Otherwise multiple jobs are needed. We’ve implemented ‘tee’ method on RDD that is similar to Unix tee utility. Below are the proposed methods:

      def tee(path: String) : RDD[T]
      

      Return a new RDD that is the same as this RDD but also save a copy of this RDD to a text file, using string representation of elements.

      def tee(path: String, f: (T) => Boolean): RDD[T]
      

      Return a new RDD that is the same as this RDD but also save to a text file a copy of the elements in this RDD that satisfy a predicate , using string representation of elements.

      These methods can be used in RDD pipelines in ways similar to the tee utility in Unix command pipeline, for example,

      sc.textFile(dataFile).map(x => x.split(“\t”)
          .map(x => (x(0), x(1).toInt, x(2))
          .tee(“output/tee-data-1”)
          .tee(“output/tee-data-2”, x=> x._2 >= 10)
          .groupBy(x => x._1)
          .saveAsTextFile(“output/out-data”)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            rding Richard Ding
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: