Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38037

Spark MLlib FPGrowth not working with 40+ items in Frequent Item set

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.0
    • None
    • ML
    • None
    • Stanalone Linux server

      32 GB RAM

      4 core

       

    Description

      We have been using Spark FPGrowth and it works well with millions of transactions (records) when the frequent items in the Frequent Itemset is less than 25. Beyond 25 it runs into computational limit. For 40+ items in the Frequent Itemset the process never return.

      To reproduce, you can create a simple data set of 3 transactions with equal items (40 of them) and run FPgrowth with 0.9 support, the process never completes. Below is a sample data I have used to narrow down the problem:

      I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21 I22 I23 I24 I25 I26 I27 I28 I29 I30 I31 I32 I33 I34 I35 I36 I37 I38 I39 I40
      I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21 I22 I23 I24 I25 I26 I27 I28 I29 I30 I31 I32 I33 I34 I35 I36 I37 I38 I39 I40
      I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21 I22 I23 I24 I25 I26 I27 I28 I29 I30 I31 I32 I33 I34 I35 I36 I37 I38 I39 I40

       

      While the computation grows (2^n -1) with each item in Frequent Itemset, it surely should be able to handle 40 or more items in a Frequest Itemset

       

      Is this a FPGrowth implementation limitation,

      are there any tuning parameters that I am missing? Thank you.

      Attachments

        Activity

          People

            Unassigned Unassigned
            RJ2022 RJ
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: