Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6728

Improve performance of py4j for large bytearray

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Critical
    • Resolution: Won't Fix
    • 1.3.0
    • None
    • PySpark
    • None

    Description

      PySpark relies on py4j to transfer function arguments and return between Python and JVM, it's very slow to pass a large bytearray (larger than 10M).

      In MLlib, it's possible to have a Vector with more than 100M bytes, which will need few GB memory, may crash.

      The reason is that py4j use text protocol, it will encode the bytearray as base64, and do multiple string concat.

      Binary will help a lot, create a issue for py4j: https://github.com/bartdag/py4j/issues/159

      Attachments

        Activity

          People

            Unassigned Unassigned
            davies Davies Liu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: