Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation. This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution. I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.

      FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines. If run in sequential mode, or with "-g 50" it will take considerable time. Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time. If you experiment with this data, I suggest using '-s 50 -regex "[ ]+"'.

      Digging into this, I've found that the tree pruning code sometimes creates surprising trees. One oddity I've observed is 0-count nodes, sometimes with non-zero children. The other is that sometimes subtrees seem to get repeated. I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

      1. smallexample.dat
        0.1 kB
        tom pierce
      2. simpleFPG.patch
        38 kB
        tom pierce
      3. MAHOUT-890-3.patch
        713 kB
        tom pierce
      4. MAHOUT-890-2.patch
        103 kB
        tom pierce
      5. MAHOUT-890.patch
        91 kB
        tom pierce
      6. logtrees.patch
        1 kB
        tom pierce
      7. addSynth.patch
        588 kB
        tom pierce

        Issue Links

          Activity

          Suneel Marthi made changes -
          Component/s Frequent Itemset/Association Rule Mining [ 12313060 ]
          Sean Owen made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Robin Anil made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Robin Anil made changes -
          Fix Version/s 0.6 [ 12316364 ]
          Robin Anil made changes -
          Assignee Robin Anil [ robinanil ]
          tom pierce made changes -
          Attachment MAHOUT-890-3.patch [ 12508970 ]
          tom pierce made changes -
          Attachment MAHOUT-890-2.patch [ 12508956 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to MAHOUT-920 [ MAHOUT-920 ]
          tom pierce made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          tom pierce made changes -
          Attachment MAHOUT-890.patch [ 12506009 ]
          tom pierce made changes -
          Attachment simpleFPG.patch [ 12504816 ]
          tom pierce made changes -
          Field Original Value New Value
          Attachment addSynth.patch [ 12504410 ]
          Attachment smallexample.dat [ 12504411 ]
          Attachment logtrees.patch [ 12504412 ]
          tom pierce created issue -

            People

            • Assignee:
              Robin Anil
              Reporter:
              tom pierce
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development