Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3211

.take() is OOM-prone when there are empty partitions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.2
    • 1.1.1, 1.2.0
    • Spark Core
    • None

    Description

      Filed on dev@ on 22 August by pnepywoda:

      On line 777
      https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771
      the logic for take() reads ALL partitions if the first one (or first k) are
      empty. This has actually lead to OOMs when we had many partitions
      (thousands) and unfortunately the first one was empty.

      Wouldn't a better implementation strategy be

      numPartsToTry = partsScanned * 2

      instead of

      numPartsToTry = totalParts - 1

      (this doubling is similar to most memory allocation strategies)

      Thanks!

      • Paul

      Attachments

        Activity

          People

            aash Andrew Ash
            aash Andrew Ash
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: