Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2014

Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.0
    • PySpark
    • None

    Description

      Since the data is serialized on the Python side, there's not much point in keeping it as byte arrays in Java, or even in skipping compression. We should make cache() in PySpark use MEMORY_ONLY_SER and turn on spark.rdd.compress for it.

      Attachments

        Activity

          People

            prashant Prashant Sharma
            matei Matei Alexandru Zaharia
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: