Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
1.2.0
-
None
-
None
-
Linux, Spark Standalone 1.2, running in a PBS grid engine
Description
The block manager keept fetching the same blocks over and over, making tasks with network activity extremely slow. Two identical tasks can take between 12 seconds up to more than an hour. (where I stopped it).
Spark should cache the blocks, so it does not fetch the same blocks over, and over, and over.
Here is a simplified version of the code that provokes it:
// Read a few thousand lines (~ 15 MB) val fileContents = sc.newAPIHadoopFile(path, ......).repartition(16) val data = fileContents.map{x => parseContent(x)}.cache() // Do a pairwise comparison and count the best pairs val pairs = data.cartesian(data).filter { case ((x,y) => similarity(x, y) > 0.9 } pairs.count()
This is a tiny fraction of one of the worker's stderr:
15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_1 remotely 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_0 remotely Thousands more lines, fetching the same 16 remote blocks 15/03/12 22:25:44 INFO BlockManager: Found block rdd_8_0 remotely 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
Details for that stage from the UI.
- Total task time across all tasks: 11.9 h
- Input: 2.2 GB
- Shuffle read: 4.5 MB
Summary Metrics for 176 Completed Tasks
Metric | Min | 25th percentile | Median | 75th percentile | Max |
---|---|---|---|---|---|
Duration | 7 s | 8 s | 8 s | 12 s | 59 min |
GC Time | 0 ms | 99 ms | 0.1 s | 0.2 s | 0.5 s |
Input | 6.9 MB | 8.2 MB | 8.4 MB | 9.0 MB | 11.0 MB |
Shuffle Read (Remote) | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 676.6 KB |
Aggregated Metrics by Executor
Executor ID | Address | Task Time | Total Tasks | Failed Tasks | Succeeded Tasks | Input | Output | Shuffle Read | Shuffle Write | Shuffle Spill (Memory) | Shuffle Spill (Disk) |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | n-62-23-3:49566 | 5.7 h | 9 | 0 | 9 | 171.0 MB | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B |
1 | n-62-23-6:57518 | 16.4 h | 20 | 0 | 20 | 169.9 MB | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B |
2 | n-62-18-48:33551 | 0 ms | 0 | 0 | 0 | 169.6 MB | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B |
3 | n-62-23-5:58421 | 2.9 min | 12 | 0 | 12 | 266.2 MB | 0.0 B | 4.5 MB | 0.0 B | 0.0 B | 0.0 B |
4 | n-62-23-1:40096 | 23 min | 164 | 0 | 164 | 1430.4 MB | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B |
Tasks
Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input | Shuffle Read | Errors |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s | 0.1 s | 6.9 MB (memory) | 676.6 KB | |
0 | 1 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min | 0.3 s | 8.7 MB (network) | 0.0 B | |
4 | 5 | 0 | SUCCESS | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 38 min | 0.4 s | 8.6 MB (network) | 0.0 B | |
3 | 4 | 0 | RUNNING | ANY | 2 / n-62-18-48 | 2015/03/12 21:55:00 | 55 min | 8.3 MB (network) | 0.0 B | ||
2 | 3 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 11 s | 0.3 s | 8.4 MB (memory) | 0.0 B | |
7 | 8 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 12 s | 0.3 s | 9.2 MB (memory) | 0.0 B | |
6 | 7 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s | 0.1 s | 8.1 MB (memory) | 0.0 B | |
5 | 6 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min | 0.3 s | 8.6 MB (network) | 0.0 B | |
9 | 10 | 0 | RUNNING | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 55 min | 8.7 MB (network) | 0.0 B |
Attachments
Issue Links
- is duplicated by
-
SPARK-6922 RDD.cartesian is much slower than join
- Resolved
- links to