Details

      Description

      The algorithm outputs more patterns that it is needed.

      I have tested Mahout's PFP-Growth algorithm with the http://www.borgelt.net/fpgrowth.html FP-Growth implementation. This implementation has an option to generate closed patterns too.

      When I filtered out the sub patterns from the output of Parallel FP-Growth I arrived to the same result, as in http://www.borgelt.net/fpgrowth.html

      Succinctly, you are not outputting closed items

      I am attaching the dummy DB along with the output of both algorithms

      1. SixTransactions.dat
        0.6 kB
        Yarco Hayduk
      2. dumpedPatterns
        9 kB
        Yarco Hayduk
      3. bresult-new.txt
        0.3 kB
        Yarco Hayduk
      4. patterns-converted.txt
        2 kB
        Yarco Hayduk

        Issue Links

          Activity

          Yarco Hayduk created issue -
          Hide
          Yarco Hayduk added a comment -

          SixTransactions.dat - the dummy DB
          dumpedPatterns - original PFP-Growth output
          patterns-converted.txt - converted dumpedPatterns to a more readable format
          bresult-new.txt - the output of Borgelt's FP-Growth implementation

          Show
          Yarco Hayduk added a comment - SixTransactions.dat - the dummy DB dumpedPatterns - original PFP-Growth output patterns-converted.txt - converted dumpedPatterns to a more readable format bresult-new.txt - the output of Borgelt's FP-Growth implementation
          Yarco Hayduk made changes -
          Field Original Value New Value
          Attachment SixTransactions.dat [ 12480019 ]
          Attachment dumpedPatterns [ 12480020 ]
          Attachment bresult-new.txt [ 12480021 ]
          Attachment patterns-converted.txt [ 12480022 ]
          Sean Owen made changes -
          Assignee Robin Anil [ robinanil ]
          Fix Version/s 0.6 [ 12316364 ]
          Fix Version/s 0.5 [ 12315255 ]
          Affects Version/s 0.5 [ 12315255 ]
          Hide
          jinyongbo added a comment -

          it looks like my problem is not the same as Hayduk.
          my test data is as follow (mahout version 0.4 , minsupport =2)
          1 2 5
          2 3
          4 5
          1 2 3
          and the output as follow
          1 ([2, 1],2)
          2 ([2],3), ([2, 3],2), ([2, 1],2)
          3 ([2, 3],2)
          5 ([5],2)
          Obviously,([1],2),([3],2) is missed

          Show
          jinyongbo added a comment - it looks like my problem is not the same as Hayduk. my test data is as follow (mahout version 0.4 , minsupport =2) 1 2 5 2 3 4 5 1 2 3 and the output as follow 1 ( [2, 1] ,2) 2 ( [2] ,3), ( [2, 3] ,2), ( [2, 1] ,2) 3 ( [2, 3] ,2) 5 ( [5] ,2) Obviously,( [1] ,2),( [3] ,2) is missed
          Hide
          jinyongbo added a comment -

          I add another question.Is the output file in frequentpatterns folder a sequencefile or not?
          Because,i want to save it as a table in Hive.

          Show
          jinyongbo added a comment - I add another question.Is the output file in frequentpatterns folder a sequencefile or not? Because,i want to save it as a table in Hive.
          Hide
          Yarco Hayduk added a comment -

          "Obviously,([1],2),([3],2) is missed"

          PFP-Growth is outputting closed patterns.
          (definition here http://ftp1.de.freebsd.org/Publications/CEUR-WS/Vol-90/liu.pdf )

          Show
          Yarco Hayduk added a comment - "Obviously,( [1] ,2),( [3] ,2) is missed" PFP-Growth is outputting closed patterns. (definition here http://ftp1.de.freebsd.org/Publications/CEUR-WS/Vol-90/liu.pdf )
          Hide
          Lance Norskog added a comment -

          (definition here http://ftp1.de.freebsd.org/Publications/CEUR-WS/Vol-90/liu.pdf )

          The link does not work; the site is up but does not like the path.

          Show
          Lance Norskog added a comment - (definition here http://ftp1.de.freebsd.org/Publications/CEUR-WS/Vol-90/liu.pdf ) The link does not work; the site is up but does not like the path.
          Hide
          Yarco Hayduk added a comment - - edited
          Show
          Yarco Hayduk added a comment - - edited This is the original paper which introduces closed patterns: http://cchen1.csie.ntust.edu.tw:8080/students/2009/Discovering%20Frequent%20Closed%20Itemsets%20for%20Association%20Rules.pdf
          Hide
          Yarco Hayduk added a comment -
          Show
          Yarco Hayduk added a comment - http://www.dataminingarticles.com/closed-maximal-itemsets.html has a nice explanation too
          Hide
          jinyongbo added a comment -

          Thanks all.

          Show
          jinyongbo added a comment - Thanks all.
          Yarco Hayduk made changes -
          Link This issue relates to MAHOUT-629 [ MAHOUT-629 ]
          Hide
          Yarco Hayduk added a comment -

          After applying this patch against the latest sources, I get a significant improvement in terms of execution time, as well as pattern output (I get much less redundant patterns).

          Show
          Yarco Hayduk added a comment - After applying this patch against the latest sources, I get a significant improvement in terms of execution time, as well as pattern output (I get much less redundant patterns).
          Hide
          Yarco Hayduk added a comment -

          OK. I found an issue - you need to remove the topdown mining completely to avoid generating redundant patterns. Now I get a 56kb pattern file instead of 1499kb one

          The pattern mining on any level needs to be done in a bottom-up fashion only. The current implementation is mining the patterns in a top-down fashion only at a certain level. Will submit a patch once I am done. Further, will try to refactor the code into logical peaces (FP-Bonsai pruning, counting etc etc).

          Show
          Yarco Hayduk added a comment - OK. I found an issue - you need to remove the topdown mining completely to avoid generating redundant patterns. Now I get a 56kb pattern file instead of 1499kb one The pattern mining on any level needs to be done in a bottom-up fashion only. The current implementation is mining the patterns in a top-down fashion only at a certain level. Will submit a patch once I am done. Further, will try to refactor the code into logical peaces (FP-Bonsai pruning, counting etc etc).
          Hide
          Robin Anil added a comment -

          That step was added as an optimization to fetch topK patterns faster. And its order of magnitude faster of some dataset. If you are refactoring, can you disable it behind a flag say a strict mode

          Show
          Robin Anil added a comment - That step was added as an optimization to fetch topK patterns faster. And its order of magnitude faster of some dataset. If you are refactoring, can you disable it behind a flag say a strict mode
          Hide
          Yarco Hayduk added a comment -

          Thank you for the suggestion. Frankly, I have not ran into a definition of "not strictly closed frequent item sets". The patterns are either closed, free or maximal (there are more definitions, these are only the most common ones). Hence, the implementation is flawed if its javadoc states that it outputs closed itemsets, and in reality it does not.
          I will verify that the decision to traverse the tree top down (at a certain level) leads to better performance and will investigate how to eliminate the issue of redundant itemsets in that scenario.

          Can you please elaborate on this line of code, as I am having a hard time understanding it:
          "minSupportValue = Math.max(minSupportValue,minSupport.longValue() / 2);"
          Why would we ever want to change the minsup? I assume that this idea comes from the http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.9324&rep=rep1&type=pdf paper?

          Show
          Yarco Hayduk added a comment - Thank you for the suggestion. Frankly, I have not ran into a definition of "not strictly closed frequent item sets". The patterns are either closed, free or maximal (there are more definitions, these are only the most common ones). Hence, the implementation is flawed if its javadoc states that it outputs closed itemsets, and in reality it does not. I will verify that the decision to traverse the tree top down (at a certain level) leads to better performance and will investigate how to eliminate the issue of redundant itemsets in that scenario. Can you please elaborate on this line of code, as I am having a hard time understanding it: "minSupportValue = Math.max(minSupportValue,minSupport.longValue() / 2);" Why would we ever want to change the minsup? I assume that this idea comes from the http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.9324&rep=rep1&type=pdf paper?
          Hide
          niu added a comment -

          where is the source code of these patch?

          Show
          niu added a comment - where is the source code of these patch?
          Hide
          Yarco Hayduk added a comment -

          can't give you the source yet....I'm still in the process of perf. testing.
          I found an important bug recently. In the mapReduce version of the program, we don't need to encode the transactions the second time, as it distorts the results. The tree gets restructured and our conditionals get moved almost to the root, instead of being in the bottom of the tree. Anyone care to try the MapReduce version with the num of groups == 1 ?) You will likely see this problem too

          Show
          Yarco Hayduk added a comment - can't give you the source yet....I'm still in the process of perf. testing. I found an important bug recently. In the mapReduce version of the program, we don't need to encode the transactions the second time, as it distorts the results. The tree gets restructured and our conditionals get moved almost to the root, instead of being in the bottom of the tree. Anyone care to try the MapReduce version with the num of groups == 1 ?) You will likely see this problem too
          Hide
          Sean Owen added a comment -

          Removing from 0.6 unless a patch is available.

          Show
          Sean Owen added a comment - Removing from 0.6 unless a patch is available.
          Sean Owen made changes -
          Fix Version/s 0.6 [ 12316364 ]
          Hide
          Sean Owen added a comment -

          We can reopen if a new patch is ever available.

          Show
          Sean Owen added a comment - We can reopen if a new patch is ever available.
          Sean Owen made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]
          Suneel Marthi made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Robin Anil
              Reporter:
              Yarco Hayduk
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development