Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-253

we need speculative execution for reduces

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 0.2.1
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None

      Description

      With my new http-based shuffle (on top of the svn head including sameer's parallel fetch), I just finished sorting 2010 g on 200 nodes in 8:49 with 9 reduce failures. However, the amusing part is that the replacement reduces were not the slow ones. 8 of the original reduces were the only things running for the last hour. The job timings looked like:

      Job 0001
      Total:
      Tasks: 16551
      Total: 10056104 secs
      Average: 607 secs
      Worst: task_0001_r_000291_0
      Worst time: 31050 secs
      Best: task_0001_m_013597_0
      Best time: 20 secs
      Maps:
      Tasks: 16151
      Total: 2762635 secs
      Average: 171 secs
      Worst: task_0001_m_002290_0
      Worst time: 2663 secs
      Best: task_0001_m_013597_0
      Best time: 20 secs
      Reduces:
      Tasks: 400
      Total: 7293469 secs
      Average: 18233 secs
      Worst: task_0001_r_000291_0
      Worst time: 31050 secs
      Best: task_0001_r_000263_1
      Best time: 5591 secs

      And the number of tasks run per a node was very uneven:

      #tasks node
      124 node1161
      117 node1307
      117 node1124
      116 node1253
      114 node1310
      111 node1302
      111 node1299
      111 node1298
      111 node1249
      111 node1221
      110 node1288
      110 node1286
      110 node1211
      109 node1268
      108 node1292
      108 node1202
      108 node1200
      107 node1313
      107 node1277
      107 node1246
      107 node1242
      107 node1231
      107 node1214
      106 node1243
      105 node1251
      105 node1212
      105 node1205
      104 node1272
      104 node1269
      104 node1210
      104 node1203
      104 node1193
      104 node1128
      103 node1300
      103 node1285
      103 node1279
      103 node1209
      103 node1173
      103 node1165
      102 node1276
      102 node1239
      102 node1228
      102 node1204
      102 node1188
      101 node1314
      101 node1303
      100 node1301
      100 node1252
      99 node1287
      99 node1213
      99 node1206
      98 node1295
      98 node1186
      97 node1293
      97 node1265
      97 node1262
      97 node1260
      97 node1258
      97 node1235
      97 node1229
      97 node1226
      97 node1215
      97 node1208
      97 node1187
      97 node1175
      97 node1171
      96 node1291
      96 node1248
      96 node1224
      96 node1216
      95 node1305
      95 node1280
      95 node1263
      95 node1254
      95 node1153
      95 node1115
      94 node1271
      94 node1261
      94 node1234
      94 node1233
      94 node1227
      94 node1225
      94 node1217
      94 node1142
      93 node1275
      93 node1198
      93 node1107
      92 node1266
      92 node1220
      92 node1219
      91 node1309
      91 node1289
      91 node1270
      91 node1259
      91 node1256
      91 node1232
      91 node1179
      89 node1290
      89 node1255
      89 node1247
      89 node1207
      89 node1201
      89 node1190
      89 node1154
      89 node1141
      88 node1306
      88 node1282
      88 node1250
      88 node1222
      88 node1184
      88 node1149
      88 node1117
      87 node1278
      87 node1257
      87 node1191
      87 node1185
      87 node1180
      86 node1297
      86 node1178
      85 node1195
      85 node1143
      85 node1112
      84 node1281
      84 node1274
      84 node1264
      83 node1296
      83 node1148
      82 node1218
      82 node1168
      82 node1167
      81 node1311
      81 node1240
      81 node1223
      81 node1196
      81 node1164
      81 node1116
      80 node1267
      80 node1230
      80 node1177
      80 node1119
      79 node1294
      79 node1199
      79 node1181
      79 node1170
      79 node1166
      79 node1103
      78 node1244
      78 node1189
      78 node1157
      77 node1304
      77 node1172
      74 node1182
      71 node1160
      71 node1147
      68 node1236
      68 node1183
      67 node1245
      59 node1139
      58 node1312
      57 node1162
      56 node1308
      56 node1197
      55 node1146
      54 node1106
      53 node1111
      53 node1105
      49 node1145
      49 node1123
      48 node1176
      46 node1136
      44 node1132
      44 node1125
      44 node1122
      44 node1108
      43 node1192
      43 node1121
      42 node1194
      42 node1138
      42 node1104
      41 node1155
      41 node1126
      41 node1114
      40 node1158
      40 node1151
      40 node1137
      40 node1110
      40 node1100
      39 node1156
      38 node1140
      38 node1135
      38 node1109
      37 node1144
      37 node1120
      36 node1118
      34 node1133
      34 node1113
      31 node1134
      26 node1127
      23 node1101
      20 node1131

      And it should not surprise us that the last 8 reduces were running on nodes 1134, 1127,1101, and 1131. This really demonstrates the need to run speculative reduce runs.

      I propose that when the list of reduce jobs running is down to 1/2 the cluster size that we start running speculative reduces. I estimate that it would have saved around an hour on this run. Does that sound like a reasonable heuristic?

        Issue Links

          Activity

          Hide
          owen.omalley Owen O'Malley added a comment -

          Minor correction, the sort took 8:39 instead of 8:49.

          Show
          owen.omalley Owen O'Malley added a comment - Minor correction, the sort took 8:39 instead of 8:49.
          Hide
          cutting Doug Cutting added a comment -

          +1

          As before, it must be possible to disable speculative execution for reduces with side-effects. Also, some of the support needs to be in the output format implementations. In particular, these should first write to a temporary name, then rename the output when the task completes, but only if the output does not already exist. We should note that this is the expected behavior for output formats in the interface javadoc, and put as much of the implementation as possible in the base class.

          Show
          cutting Doug Cutting added a comment - +1 As before, it must be possible to disable speculative execution for reduces with side-effects. Also, some of the support needs to be in the output format implementations. In particular, these should first write to a temporary name, then rename the output when the task completes, but only if the output does not already exist. We should note that this is the expected behavior for output formats in the interface javadoc, and put as much of the implementation as possible in the base class.
          Hide
          sameerp Sameer Paranjpye added a comment -

          Duplicate of HADOOP-76

          Show
          sameerp Sameer Paranjpye added a comment - Duplicate of HADOOP-76

            People

            • Assignee:
              owen.omalley Owen O'Malley
              Reporter:
              owen.omalley Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development