Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6381

add Apriori algorithm to MLLib

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.3.0
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None

      Description

      Xiangrui Meng
      There are many algorithms about association rule mining,for example FPGrowth, Apriori and so on.these algorithms are classic

      algorithms in machine learning, and there are very much usefully in big data mining. Even the FPGrowth algorithm in spark

      1.3 version have implementation to solution big big data set, but it need create FPTree before mining frequent item. so

      while transition data is smaller and the data is sparse and minSupport is bigger,wen can select Apriori algorithms.
      how Apriori algorithm parallelism?
      1.Generates frequent items by filtering the input data using minimal support level.
      private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: Long,partitioner: Partitioner): Array[Item]
      2.Generate frequent itemSets by building apriori, the extraction is done on each partition.
      2.1 create candidateSet by kFreqItems and k
      private def createCandidateSet[Item: ClassTag]( kFreqItems: Array[(Array[Item], Long)], k: Int)
      2.2 create kFreqItems from candidateSet is generated by candidateSet
      private def scanDataSet[Item: ClassTag](dataSet: RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double):
      RDD[(Array[Item], Long)]
      2.3 filter dataSet by candidateSet.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              zhangyouhua zhangyouhua

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment