Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12635

More efficient (column batch) serialization for Python/R

    Details

    • Type: New Feature
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: PySpark, SparkR, SQL
    • Labels:
      None

      Description

      Serialization between Scala / Python / R is pretty slow. Python and R both work pretty well with column batch interface (e.g. numpy arrays). Technically we should be able to just pass column batches around with minimal serialization (maybe even zero copy memory).

      Note that this depends on some internal refactoring to use a column batch interface in Spark SQL.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rxin Reynold Xin
            • Votes:
              6 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated: