[SPARK-20106] Nonlazy caching of DataFrame after orderBy/sort - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 2.0.1, 2.1.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None

Description

Calling cache or persist after a call to orderBy or sortBy on a DataFrame runs not lazy and creates a Spark job:

spark.range(1, 1000).orderBy("id").cache()

Other operations do not generate a job when cached:

spark.range(1, 1000).repartition(2).cache()
spark.range(1, 1000).groupBy("id").agg(fn.min("id")).cache()
spark.range(1, 1000).cache()

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Richard Liebscher

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Mar/17 09:33

Updated:: 27/Mar/17 12:03

Resolved:: 27/Mar/17 12:03