Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16921

RDD/DataFrame persist() and cache() should return Python context managers

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: PySpark, Spark Core, SQL
    • Labels:

      Description

      Context managers are a natural way to capture closely related setup and teardown code in Python.

      For example, they are commonly used when doing file I/O:

      with open('/path/to/file') as f:
          contents = f.read()
          ...
      

      Once the program exits the with block, f is automatically closed.

      I think it makes sense to apply this pattern to persisting and unpersisting DataFrames and RDDs. There are many cases when you want to persist a DataFrame for a specific set of operations and then unpersist it immediately afterwards.

      For example, take model training. Today, you might do something like this:

      labeled_data.persist()
      model = pipeline.fit(labeled_data)
      labeled_data.unpersist()
      

      If persist() returned a context manager, you could rewrite this as follows:

      with labeled_data.persist():
          model = pipeline.fit(labeled_data)
      

      Upon exiting the with block, labeled_data would automatically be unpersisted.

      This can be done in a backwards-compatible way since persist() would still return the parent DataFrame or RDD as it does today, but add two methods to the object: __enter__() and __exit__()

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              nchammas Nicholas Chammas
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: