It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
/** * create a sparse vector out of list of tuple2's * @param sdata cardinality * @return */ def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = { val required = if (sdata.nonEmpty) + 1 else 0 var tmp = -1 if (cardinality < 0) { tmp = required } else if (cardinality < required) { throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality") } else { tmp = cardinality } val initialCapacity = sdata.size val sv = new RandomAccessSparseVector(tmp, initialCapacity) sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue())) sv }
So user can specify the cardinality for the created sparse vector.
This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
Below code should demonstrate the case:
var cardinality = 20 val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality))) val drm = drmWrap( => (row._1, row._2.asInstanceOf[Vector]))) // All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector val drm2 = drm + drm.t val drm3 = drm - drm.t val drm4 = drm * drm.t val drm5 = drm / drm.t
Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.