Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-1011

Slow sparse append cbind (sparse row re-allocations)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • SystemML 0.11
    • None
    • None

    Description

      All algorithms that support the 'intercept' option (e.g., LinregCG, LinregDS, L2SVM, MSVM, Mlogreg, and GLM) append a column of 1s in the beginning of the script. On large sparse data, this append sometimes dominates end-to-end performance. For example, here are the LinregCG results for a 10Mx1K scenario with sparsity 0.01.

      -- Running runLinearRegCG on 10M_1k_sparse (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
      -- Running runLinearRegCG on 10M_1k_sparse (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
      -- Running runLinearRegCG on 10M_1k_sparse (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 6
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 16
      -- Running runLinearRegCG on 10M_1k_sparse (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 16
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
      

      and here is the related -stats output for ict=1.

      Total elapsed time:		16.893 sec.
      Total compilation time:		2.412 sec.
      Total execution time:		14.480 sec.
      Number of compiled Spark inst:	0.
      Number of executed Spark inst:	0.
      Cache hits (Mem, WB, FS, HDFS):	172/0/0/2.
      Cache writes (WB, FS, HDFS):	77/0/1.
      Cache times (ACQr/m, RLS, EXP):	1.734/0.003/2.143/0.209 sec.
      HOP DAGs recompiled (PRED, SB):	0/0.
      HOP DAGs recompile time:	0.000 sec.
      Spark ctx create time (lazy):	0.000 sec.
      Spark trans counts (par,bc,col):0/0/0.
      Spark trans times (par,bc,col):	0.000/0.000/0.000 secs.
      Total JIT compile time:		5.357 sec.
      Total JVM GC count:		2.
      Total JVM GC time:		5.628 sec.
      Heavy hitter instructions (name, time, count):
      -- 1) 	append 	8.595 sec 	26
      -- 2) 	mmchain 	4.443 sec 	8
      -- 3) 	ba+* 	0.537 sec 	10
      -- 4) 	r' 	0.411 sec 	10
      -- 5) 	write 	0.210 sec 	1
      -- 6) 	- 	0.087 sec 	20
      -- 7) 	uak+ 	0.059 sec 	2
      -- 8) 	tsmm 	0.049 sec 	11
      -- 9) 	rand 	0.043 sec 	5
      -- 10) 	+* 	0.007 sec 	24
      

      The large GC time indicates that sparse row re-allocations are a major issue here. We should compute the joint nnz per output row, and allocate the output sparse row just once.

      Attachments

        Issue Links

          Activity

            People

              mboehm7 Matthias Boehm
              mboehm7 Matthias Boehm
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: