Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-2816

Unnecessary overhead due to incorrect spark broadcast cleanup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • SystemDS 2.1
    • None
    • None

    Description

      On running multiLogReg on one day of the Criteo dataset (~65GB, 192215183 x 40) showed initially good performance but then after a while individual spark jobs got significantly slower. The underlying issue is an incorrect tracking of live broadcast objects (and their sizes), which are taken into account when deciding to collect the output of an RDD operation or pipe it to HDFS and then read it in to avoid the double memory requirement (list of blocks and target matrix).

      Here is the beginning of an annotated trace that shows the collected intermediates and the currently tracked parallelized RDD sizes, and broadcast sizes, where the latter increases over iterations:

      multiLogReg: matrix X contains 3.1180385E8 missing values, replacing with 0.
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 1546178968
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 1546178968
      -- Initially:  Objective = 1.3323341215726392E8,  Gradient Norm = 1.3370025117036335E15,  Trust Delta = 8.96687446468921E                                 -8
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 4638537632
      -- Outer Iteration 1: Had 1 CG iterations, trust bound REACHED
         -- Obj.Reduction:  Actual = 7.376044631418988E7,  Predicted = 6.79357235016733E7  (A/P: 1.0857),  Trust Delta = 1.1652                                 798264371577E-7
         -- New Objective = 5.947296584307404E7,  Beta Change Norm = 8.96687446468921E-8,  Gradient Norm = 4.205610284123732E14
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 9277075628
      -- Outer Iteration 2: Had 2 CG iterations
         -- Obj.Reduction:  Actual = 2.0680665323246337E7,  Predicted = 1.7237411570907284E7  (A/P: 1.1998),  Trust Delta = 1.3                                 498247047656392E-7
         -- New Objective = 3.87923005198277E7,  Beta Change Norm = 1.0801908934149486E-7,  Gradient Norm = 1.2537458077726923E14
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 13915613624
      -- Outer Iteration 3: Had 2 CG iterations
         -- Obj.Reduction:  Actual = 3856414.333940871,  Predicted = 3280015.9111615047  (A/P: 1.1757),  Trust Delta = 1.3498247047656392E-7
         -- New Objective = 3.493588618588683E7,  Beta Change Norm = 6.80215369514393E-8,  Gradient Norm = 3.0650099929442824E13
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 18554151620
      -- Outer Iteration 4: Had 2 CG iterations
         -- Obj.Reduction:  Actual = 454099.7561884448,  Predicted = 412109.7441715886  (A/P: 1.1019),  Trust Delta = 1.3498247047656392E-7
         -- New Objective = 3.4481786429698385E7,  Beta Change Norm = 3.836587046446831E-8,  Gradient Norm = 4.523909094264681E12
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 24738868948
      -- Outer Iteration 5: Had 3 CG iterations, trust bound REACHED
         -- Obj.Reduction:  Actual = 208441.3064660579,  Predicted = 204927.98343597393  (A/P: 1.0171),  Trust Delta = 1.8934146325940216E-7
         -- New Objective = 3.427334512323233E7,  Beta Change Norm = 1.3498247047656392E-7,  Gradient Norm = 2.7783321485314087E12
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 32469765608
      -- Outer Iteration 6: Had 4 CG iterations, trust bound REACHED
         -- Obj.Reduction:  Actual = 165939.54292698205,  Predicted = 164350.2297838717  (A/P: 1.0097),  Trust Delta = 2.589771486748884E-7
         -- New Objective = 3.4107405580305345E7,  Beta Change Norm = 1.8934146325940216E-7,  Gradient Norm = 4.0411007607393413E12
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 40200662268
      -- Outer Iteration 7: Had 4 CG iterations, trust bound REACHED
         -- Obj.Reduction:  Actual = 272763.50560534,  Predicted = 273275.77875587903  (A/P: 0.9981),  Trust Delta = 4.931622371939089E-7
         -- New Objective = 3.3834642074700005E7,  Beta Change Norm = 2.589771486748884E-7,  Gradient Norm = 2.412446015995407E12
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 47931558928
      -- Outer Iteration 8: Had 4 CG iterations, trust bound REACHED
         -- Obj.Reduction:  Actual = 298278.3447513506,  Predicted = 290429.90745707706  (A/P: 1.027),  Trust Delta = 5.224497917600619E-7
         -- New Objective = 3.3536363729948655E7,  Beta Change Norm = 4.931622371939088E-7,  Gradient Norm = 4.1054855722029614E12
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 57208634920
      -- Outer Iteration 9: Had 5 CG iterations, trust bound REACHED
         -- Obj.Reduction:  Actual = 481602.88423537463,  Predicted = 488455.4207090684  (A/P: 0.986),  Trust Delta = 6.743092577663683E-7
         -- New Objective = 3.305476084571328E7,  Beta Change Norm = 5.224497917600618E-7,  Gradient Norm = 6.068325237011659E12
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 68031890244
      -- Outer Iteration 10: Had 6 CG iterations, trust bound REACHED
         -- Obj.Reduction:  Actual = 450117.9838358946,  Predicted = 437852.75430859264  (A/P: 1.028),  Trust Delta = 9.526909409859216E-7
         -- New Objective = 3.2604642861877386E7,  Beta Change Norm = 6.743092577663683E-7,  Gradient Norm = 5.060731543199763E12
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 78855145568
      -- Outer Iteration 11: Had 6 CG iterations, trust bound REACHED
         -- Obj.Reduction:  Actual = 654956.862544518,  Predicted = 655726.9758760328  (A/P: 0.9988),  Trust Delta = 1.9711345764520718E-6
         -- New Objective = 3.1949685999332868E7,  Beta Change Norm = 9.526909409859214E-7,  Gradient Norm = 2.9454194152650522E12
      Check write RDD: [192215183 x 1, nnz=-1 (false), blocks (1000 x 1000)] 0 89678400892
      

      Attachments

        Activity

          People

            mboehm7 Matthias Boehm
            mboehm7 Matthias Boehm
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: