Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35423

The output of PCA is inconsistent

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.1.1
    • 3.2.0
    • MLlib
    • None
    • Spark Version: 3.1.1 

    Description

      1. The example from doc

       

      import org.apache.spark.ml.feature.PCA
      import org.apache.spark.ml.linalg.Vectors
      
      val data = Array(
        Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
        Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
        Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
      )
      val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
      
      val pca = new PCA()
        .setInputCol("features")
        .setOutputCol("pcaFeatures")
        .setK(3)
        .fit(df)
      
      val result = pca.transform(df).select("pcaFeatures")
      result.show(false)
      

       

       

      the output show:

      +-----------------------------------------------------------+
      |pcaFeatures                                                |
      +-----------------------------------------------------------+
      |[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
      |[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
      |[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
      +-----------------------------------------------------------+
      

      2. change the Vector format

      I modified the code from "Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))" to "Vectors.dense(0.0,1.0,0.0,7.0,0.0)" 。

      but the output show:

      +------------------------------------------------------------+
      |pcaFeatures                                                 |
      +------------------------------------------------------------+
      |[1.6485728230883814,-4.0132827005162985,-1.0091435193998504]|
      |[-4.645104331781533,-1.1167972663619048,-1.0091435193998501]|
      |[-6.428880535676488,-5.337951427775359,-1.009143519399851]  |
      +------------------------------------------------------------+
      

      It's strange that the two outputs are inconsistent. Why?

      Thanks.

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            shahid shahid Assign to me
            cqfrog cqfrog
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment