Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Abandoned
-
None
-
None
-
standalone mode on labtop, and yarn cluster with 10 nodes
Description
I'm testing the following NMF algorithm written using python API:
from pyspark.sql import SQLContext import systemml as sml from systemml import random sqlContext = SQLContext(sc) sml.setSparkContext(sc) m, n = tfidf.shape k = 40 V = sml.matrix(tfidf) W = sml.random.uniform(size=(m, k)) H = sml.random.uniform(size=(k, n)) max_iters = 200 for i in range(max_iters): H = H * (W.transpose().dot(V))/(W.transpose().dot(W.dot(H))) W = W * (V.dot(H.transpose()))/(W.dot(H.dot(H.transpose()))) W = W.toNumPyArray()
Here tfidf is a sparse matrix of shape (114720, 11590)
The evaluation of W takes more than one hour when running on laptop. On yarn cluster, it didn't finish in 1.5 hours (I killed the job).
If I evaluate H matrix instead, it just takes 2 min.
Note that even if I call eval before evaluating W, it doesn't make any difference. W still takes an hour.