Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28843

Set OMP_NUM_THREADS to executor cores reduce Python memory consumption

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.3, 2.4.3, 3.0.0
    • Fix Version/s: 3.0.0
    • Component/s: PySpark
    • Labels:
    • Docs Text:
      Hide
      Pyspark workers now set the env variable OMP_NUM_THREADS (if not already set) to the number of cores used by an executor (spark.executor.cores). When unset, it defaulted to the total number of VM cores. This avoids excessively large OpenMP thread pools when using, for example, numpy.
      Show
      Pyspark workers now set the env variable OMP_NUM_THREADS (if not already set) to the number of cores used by an executor (spark.executor.cores). When unset, it defaulted to the total number of VM cores. This avoids excessively large OpenMP thread pools when using, for example, numpy.

      Description

      While testing hardware with more cores, we found that the amount of memory required by PySpark applications increased and tracked the problem to importing numpy. The numpy issue isĀ https://github.com/numpy/numpy/issues/10455

      NumPy uses OpenMP that starts a thread pool with the number of cores on the machine (and does not respect cgroups). When we set this lower we see a significant reduction in memory consumption.

      This parallelism setting should be set to the number of cores allocated to the executor, not the number of cores available.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                rdblue Ryan Blue
                Reporter:
                rdblue Ryan Blue
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: