[YARN-6289] Fail to achieve data locality when runing MapReduce and Spark on HDFS - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: distributed-scheduling
Labels:
None
Environment:

Hide

Hardware configuration
CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread
Memory: 128GB Memory (16x8GB) 1600MHz
Disk: 600GBx2 3.5-inch with RAID-1
Network bandwidth: 968Mb/s
Software configuration
Spark-1.6.2 Hadoop-2.7.1

Show
Hardware configuration CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread Memory: 128GB Memory (16x8GB) 1600MHz Disk: 600GBx2 3.5-inch with RAID-1 Network bandwidth: 968Mb/s Software configuration Spark-1.6.2 Hadoop-2.7.1

Target Version/s:

2.7.1

Description

When running a simple wordcount experiment on YARN, I noticed that the task failed to achieve data locality, even though there is no other job running on the cluster at the same time. The experiment was done in a 7-node (1 master, 6 data nodes/node managers) cluster and the input of the wordcount job (both Spark and MapReduce) is a single-block file in HDFS which is two-way replicated (replication factor = 2). I ran wordcount on YARN for 10 times. The results show that only 30% of tasks can achieve data locality, which seems like the result of a random placement of tasks. The experiment details are in the attachment, and feel free to reproduce the experiments.