[SPARK-3498] Block always replicated to the same node - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.0.2
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

When running a spark streaming job, we should replicate receiver blocks, but all the blocks replicated to the same node. Here is the log.

14/09/10 19:55:16 INFO BlockManagerInfo: Added input-0-1410350117000 in memory on 10.196.131.19:42261 (size: 8.9 MB, free: 1050.3 MB)
14/09/10 19:55:16 INFO BlockManagerInfo: Added input-0-1410350117000 in memory on tdw-10-196-130-155:51155 (size: 8.9 MB, free: 879.3 MB)
14/09/10 19:55:17 INFO BlockManagerInfo: Added input-0-1410350118000 in memory on 10.196.131.19:42261 (size: 7.7 MB, free: 1042.6 MB)
14/09/10 19:55:17 INFO BlockManagerInfo: Added input-0-1410350118000 in memory on tdw-10-196-130-155:51155 (size: 7.7 MB, free: 871.6 MB)
14/09/10 19:55:18 INFO BlockManagerInfo: Added input-0-1410350119000 in memory on 10.196.131.19:42261 (size: 7.3 MB, free: 1035.3 MB)
14/09/10 19:55:18 INFO BlockManagerInfo: Added input-0-1410350119000 in memory on tdw-10-196-130-155:51155 (size: 7.3 MB, free: 864.3 MB)

The reason is when blockManagerSlave ask blockManagerMaster for a blockManagerId, blockManagerMaster always return the same blockManagerId. Here is the code:

private def getPeers(blockManagerId: BlockManagerId, size: Int): Seq[BlockManagerId] = {
val peers: Array[BlockManagerId] = blockManagerInfo.keySet.toArray

val selfIndex = peers.indexOf(blockManagerId)
if (selfIndex == -1)

{ throw new SparkException("Self index for " + blockManagerId + " not found") }

// Note that this logic will select the same node multiple times if there aren't enough peers
Array.tabulate[BlockManagerId](size)

{ i => peers((selfIndex + i + 1) % peers.length) }

.toSeq
}

I think the blockManagerMaster should return the size of blockManagerId with more remain memory .

Attachments

Issue Links

duplicates

SPARK-3495 Block replication fails continuously when the replication target node is dead

Resolved

Activity

People

Assignee:: Tathagata Das

Reporter:: shenh062326

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Sep/14 03:01

Updated:: 12/Sep/14 21:23

Resolved:: 12/Sep/14 21:23