Description
1. The example from doc
import org.apache.spark.ml.feature.PCA import org.apache.spark.ml.linalg.Vectors val data = Array( Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0), Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) ) val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features") val pca = new PCA() .setInputCol("features") .setOutputCol("pcaFeatures") .setK(3) .fit(df) val result = pca.transform(df).select("pcaFeatures") result.show(false)
the output show:
+-----------------------------------------------------------+ |pcaFeatures | +-----------------------------------------------------------+ |[1.6485728230883807,-4.013282700516296,-5.524543751369388] | |[-4.645104331781534,-1.1167972663619026,-5.524543751369387]| |[-6.428880535676489,-5.337951427775355,-5.524543751369389] | +-----------------------------------------------------------+
2. change the Vector format
I modified the code from "Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))" to "Vectors.dense(0.0,1.0,0.0,7.0,0.0)" 。
but the output show:
+------------------------------------------------------------+ |pcaFeatures | +------------------------------------------------------------+ |[1.6485728230883814,-4.0132827005162985,-1.0091435193998504]| |[-4.645104331781533,-1.1167972663619048,-1.0091435193998501]| |[-6.428880535676488,-5.337951427775359,-1.009143519399851] | +------------------------------------------------------------+
It's strange that the two outputs are inconsistent. Why?
Thanks.