Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-677

PySpark should not collect results through local filesystem

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0
    • Fix Version/s: 1.2.2, 1.3.1, 1.4.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance.

      Instead, we should stream the data from Java to Python over a local socket or a FIFO.

        Attachments

          Activity

            People

            • Assignee:
              davies Davies Liu
              Reporter:
              joshrosen Josh Rosen
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: