Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
1.6.0
-
None
-
None
Description
In our application, we sometimes need to save the partial/intermediate results to side files in the middle of a data pipeline/DAG. The only way now to do this is to use saveAsTextFile method which only runs at the end of a pipeline. Otherwise multiple jobs are needed. We’ve implemented ‘tee’ method on RDD that is similar to Unix tee utility. Below are the proposed methods:
def tee(path: String) : RDD[T]
Return a new RDD that is the same as this RDD but also save a copy of this RDD to a text file, using string representation of elements.
def tee(path: String, f: (T) => Boolean): RDD[T]
Return a new RDD that is the same as this RDD but also save to a text file a copy of the elements in this RDD that satisfy a predicate , using string representation of elements.
These methods can be used in RDD pipelines in ways similar to the tee utility in Unix command pipeline, for example,
sc.textFile(dataFile).map(x => x.split(“\t”) .map(x => (x(0), x(1).toInt, x(2)) .tee(“output/tee-data-1”) .tee(“output/tee-data-2”, x=> x._2 >= 10) .groupBy(x => x._1) .saveAsTextFile(“output/out-data”)