[SPARK-38037] Spark MLlib FPGrowth not working with 40+ items in Frequent Item set - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: ML
Labels:
None
Environment:

Stanalone Linux server

32 GB RAM

4 core

Description

We have been using Spark FPGrowth and it works well with millions of transactions (records) when the frequent items in the Frequent Itemset is less than 25. Beyond 25 it runs into computational limit. For 40+ items in the Frequent Itemset the process never return.

To reproduce, you can create a simple data set of 3 transactions with equal items (40 of them) and run FPgrowth with 0.9 support, the process never completes. Below is a sample data I have used to narrow down the problem:

I10

I11

I12

I13

I14

I15

I16

I17

I18

I19

I20

I21

I22

I23

I24

I25

I26

I27

I28

I29

I30

I31

I32

I33

I34

I35

I36

I37

I38

I39

I40

I10

I11

I12

I13

I14

I15

I16

I17

I18

I19

I20

I21

I22

I23

I24

I25

I26

I27

I28

I29

I30

I31

I32

I33

I34

I35

I36

I37

I38

I39

I40

I10

I11

I12

I13

I14

I15

I16

I17

I18

I19

I20

I21

I22

I23

I24

I25

I26

I27

I28

I29

I30

I31

I32

I33

I34

I35

I36

I37

I38

I39

I40

While the computation grows (2^n -1) with each item in Frequent Itemset, it surely should be able to handle 40 or more items in a Frequest Itemset

Is this a FPGrowth implementation limitation,

are there any tuning parameters that I am missing? Thank you.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: RJ

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Jan/22 14:52

Updated:: 11/Feb/22 03:44