[SPARK-4740] Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: 1.2.0
Component/s: Shuffle, Spark Core
Labels:
None

Target Version/s:

1.2.0

Description

When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO.

We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000.

—
Reynold update on Dec 15, 2014: The problem is that in NIO we have multiple connections between two nodes, but in Netty we only had one. We introduced a new config option spark.shuffle.io.numConnectionsPerPeer to allow users to explicitly increase the number of connections between two nodes. ~~SPARK-4853~~ is a follow-up ticket to investigate setting this automatically by Spark.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

(rxin patch better executor)TestRunner sort-by-key - Thread dump for executor 3_files.zip
07/Dec/14 01:48
76 kB
Zhang, Liye
(rxin patch normal executor)TestRunner sort-by-key - Thread dump for executor 0 _files.zip
07/Dec/14 01:49
74 kB
Zhang, Liye
repartition test.7z
15/Dec/14 11:58
4.68 MB
Zhang, Liye
rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
08/Dec/14 16:11
6.68 MB
Zhang, Liye
Spark-perf Test Report.pdf
04/Dec/14 14:10
534 kB
Zhang, Liye
Spark-perf Test Report 16 Cores per Executor.pdf
05/Dec/14 15:44
528 kB
Zhang, Liye
TestRunner sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip
05/Dec/14 04:45
74 kB
Zhang, Liye
TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per node).zip
05/Dec/14 10:00
75 kB
Zhang, Liye

Issue Links

causes

SPARK-28261 Flaky test: org.apache.spark.network.TransportClientFactorySuite.reuseClientsUpToConfigVariable

Resolved

is related to

SPARK-4853 Automatically adjust the number of connections between two peers to achieve good performance

Resolved

links to

[Github] Pull Request #3625 (rxin)

[Github] Pull Request #3667 (rxin)

Activity

People

Assignee:: Reynold Xin

Reporter:: Zhang, Liye

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 04/Dec/14 07:57

Updated:: 08/Jul/19 17:24

Resolved:: 15/Dec/14 18:18