Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28850

Binary Files RDD allocates false number of threads

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: 2.4.3
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:

      Description

      When making a call to:

      sc.binaryFiles(somePath)

       

      It creates a BinaryFileRDD. Some sections of that code are run inside the driver container. The current source code for BinaryFileRDD is [available here|https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala]:

      The problematic line is:

       

      conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, Runtime.getRuntime.availableProcessors().toString)
      

       

      This line sets the number of Threads to be used (in the case of multi-threading reading) to the number of cores (including Hyper Threading ones) available one the driver host machine.

      This number is false, since what really matters is the number of cores allocated to the driver container by YARN and not the number of cores available in the host machine. This can easily impact the Spark-UI and the driver application performance, since the number of threads is far bigger than the true amount of allocated cores - which increases the number of unrequired preemptions and context switches

      The solution is to retrieve the number of cores allocated to the Application Master by YARN instead.

      Once confirmed the problem, I can work on retrieving that information and making a PR.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              marcolotz Marco Lotz
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified