Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30580

Why can PySpark persist data only in serialised format?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Invalid
    • 2.4.0
    • None
    • PySpark

    Description

      The storage levels in PySpark allow to persist data only in serialised format. There is also a comment explicitly stating that "Since the data is always serialized on the Python side, all the constants use the serialized formats." While that makes totally sense for RDDs, it is not clear to me why it is not possible to persist data without serialisation when using the dataframe/dataset APIs. In theory, in such cases, the persist would only be a directive and data would never leave the JVM, thus allowing for un-serialised persistence, correct? Many thanks for the feedback!

      Attachments

        Activity

          People

            Unassigned Unassigned
            FC Francesco Cavrini
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: