Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
On running multiLogReg on one day of the Criteo dataset (~65GB, 192215183 x 40) showed initially good performance but then after a while individual spark jobs got significantly slower. The underlying issue is an incorrect tracking of live broadcast objects (and their sizes), which are taken into account when deciding to collect the output of an RDD operation or pipe it to HDFS and then read it in to avoid the double memory requirement (list of blocks and target matrix).
Here is the beginning of an annotated trace that shows the collected intermediates and the currently tracked parallelized RDD sizes, and broadcast sizes, where the latter increases over iterations:
multiLogReg: matrix X contains 3.1180385E8 missing values, replacing with 0. Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 1546178968 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 1546178968 -- Initially: Objective = 1.3323341215726392E8, Gradient Norm = 1.3370025117036335E15, Trust Delta = 8.96687446468921E -8 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 4638537632 -- Outer Iteration 1: Had 1 CG iterations, trust bound REACHED -- Obj.Reduction: Actual = 7.376044631418988E7, Predicted = 6.79357235016733E7 (A/P: 1.0857), Trust Delta = 1.1652 798264371577E-7 -- New Objective = 5.947296584307404E7, Beta Change Norm = 8.96687446468921E-8, Gradient Norm = 4.205610284123732E14 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 9277075628 -- Outer Iteration 2: Had 2 CG iterations -- Obj.Reduction: Actual = 2.0680665323246337E7, Predicted = 1.7237411570907284E7 (A/P: 1.1998), Trust Delta = 1.3 498247047656392E-7 -- New Objective = 3.87923005198277E7, Beta Change Norm = 1.0801908934149486E-7, Gradient Norm = 1.2537458077726923E14 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 13915613624 -- Outer Iteration 3: Had 2 CG iterations -- Obj.Reduction: Actual = 3856414.333940871, Predicted = 3280015.9111615047 (A/P: 1.1757), Trust Delta = 1.3498247047656392E-7 -- New Objective = 3.493588618588683E7, Beta Change Norm = 6.80215369514393E-8, Gradient Norm = 3.0650099929442824E13 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 18554151620 -- Outer Iteration 4: Had 2 CG iterations -- Obj.Reduction: Actual = 454099.7561884448, Predicted = 412109.7441715886 (A/P: 1.1019), Trust Delta = 1.3498247047656392E-7 -- New Objective = 3.4481786429698385E7, Beta Change Norm = 3.836587046446831E-8, Gradient Norm = 4.523909094264681E12 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 24738868948 -- Outer Iteration 5: Had 3 CG iterations, trust bound REACHED -- Obj.Reduction: Actual = 208441.3064660579, Predicted = 204927.98343597393 (A/P: 1.0171), Trust Delta = 1.8934146325940216E-7 -- New Objective = 3.427334512323233E7, Beta Change Norm = 1.3498247047656392E-7, Gradient Norm = 2.7783321485314087E12 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 32469765608 -- Outer Iteration 6: Had 4 CG iterations, trust bound REACHED -- Obj.Reduction: Actual = 165939.54292698205, Predicted = 164350.2297838717 (A/P: 1.0097), Trust Delta = 2.589771486748884E-7 -- New Objective = 3.4107405580305345E7, Beta Change Norm = 1.8934146325940216E-7, Gradient Norm = 4.0411007607393413E12 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 40200662268 -- Outer Iteration 7: Had 4 CG iterations, trust bound REACHED -- Obj.Reduction: Actual = 272763.50560534, Predicted = 273275.77875587903 (A/P: 0.9981), Trust Delta = 4.931622371939089E-7 -- New Objective = 3.3834642074700005E7, Beta Change Norm = 2.589771486748884E-7, Gradient Norm = 2.412446015995407E12 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 47931558928 -- Outer Iteration 8: Had 4 CG iterations, trust bound REACHED -- Obj.Reduction: Actual = 298278.3447513506, Predicted = 290429.90745707706 (A/P: 1.027), Trust Delta = 5.224497917600619E-7 -- New Objective = 3.3536363729948655E7, Beta Change Norm = 4.931622371939088E-7, Gradient Norm = 4.1054855722029614E12 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 57208634920 -- Outer Iteration 9: Had 5 CG iterations, trust bound REACHED -- Obj.Reduction: Actual = 481602.88423537463, Predicted = 488455.4207090684 (A/P: 0.986), Trust Delta = 6.743092577663683E-7 -- New Objective = 3.305476084571328E7, Beta Change Norm = 5.224497917600618E-7, Gradient Norm = 6.068325237011659E12 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 68031890244 -- Outer Iteration 10: Had 6 CG iterations, trust bound REACHED -- Obj.Reduction: Actual = 450117.9838358946, Predicted = 437852.75430859264 (A/P: 1.028), Trust Delta = 9.526909409859216E-7 -- New Objective = 3.2604642861877386E7, Beta Change Norm = 6.743092577663683E-7, Gradient Norm = 5.060731543199763E12 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 78855145568 -- Outer Iteration 11: Had 6 CG iterations, trust bound REACHED -- Obj.Reduction: Actual = 654956.862544518, Predicted = 655726.9758760328 (A/P: 0.9988), Trust Delta = 1.9711345764520718E-6 -- New Objective = 3.1949685999332868E7, Beta Change Norm = 9.526909409859214E-7, Gradient Norm = 2.9454194152650522E12 Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 89678400892