Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20106

Nonlazy caching of DataFrame after orderBy/sort

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 2.0.1, 2.1.0
    • None
    • PySpark, SQL
    • None

    Description

      Calling cache or persist after a call to orderBy or sortBy on a DataFrame runs not lazy and creates a Spark job:

      spark.range(1, 1000).orderBy("id").cache()

      Other operations do not generate a job when cached:

      spark.range(1, 1000).repartition(2).cache()
      spark.range(1, 1000).groupBy("id").agg(fn.min("id")).cache()
      spark.range(1, 1000).cache()

      Attachments

        Activity

          People

            Unassigned Unassigned
            richardtt Richard Liebscher
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: