[SPARK-13378] Add tee method to RDD - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

In our application, we sometimes need to save the partial/intermediate results to side files in the middle of a data pipeline/DAG. The only way now to do this is to use saveAsTextFile method which only runs at the end of a pipeline. Otherwise multiple jobs are needed. We’ve implemented ‘tee’ method on RDD that is similar to Unix tee utility. Below are the proposed methods:

def tee(path: String) : RDD[T]

Return a new RDD that is the same as this RDD but also save a copy of this RDD to a text file, using string representation of elements.

def tee(path: String, f: (T) => Boolean): RDD[T]

Return a new RDD that is the same as this RDD but also save to a text file a copy of the elements in this RDD that satisfy a predicate , using string representation of elements.

These methods can be used in RDD pipelines in ways similar to the tee utility in Unix command pipeline, for example,

sc.textFile(dataFile).map(x => x.split(“\t”)
    .map(x => (x(0), x(1).toInt, x(2))
    .tee(“output/tee-data-1”)
    .tee(“output/tee-data-2”, x=> x._2 >= 10)
    .groupBy(x => x._1)
    .saveAsTextFile(“output/out-data”)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Richard Ding

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Feb/16 00:18

Updated:: 19/Feb/16 09:53

Resolved:: 19/Feb/16 09:53