[SPARK-29814] Missing persist on sources in mllib.feature.PCA - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.4.3
Fix Version/s: None
Component/s: MLlib
Labels:
None

Description

The rdd is used in more than one actions: first() and actions in computePrincipalComponentsAndExplainedVariance(), so it needs to be persisted.

  def fit(sources: RDD[Vector]): PCAModel = {
    // first use rdd sources on action first()
    val numFeatures = sources.first().size
    require(k <= numFeatures,
      s"source vector size $numFeatures must be no less than k=$k")
    require(PCAUtil.memoryCost(k, numFeatures) < Int.MaxValue,
      "The param k and numFeatures is too large for SVD computation. " +
      "Try reducing the parameter k for PCA, or reduce the input feature " +
      "vector dimension to make this tractable.")

    val mat = new RowMatrix(sources)
    // second use rdd sources
    val (pc, explainedVariance) = mat.computePrincipalComponentsAndExplainedVariance(k)

Attachments

Issue Links

duplicates

SPARK-29818 Missing persist on RDD

Resolved

links to

GitHub Pull Request #26451

Activity

People

Assignee:: Unassigned

Reporter:: IcySanwitch

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Nov/19 13:02

Updated:: 10/Nov/19 19:22

Resolved:: 10/Nov/19 19:19