Details

      Description

      I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation. This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution. I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.

      FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines. If run in sequential mode, or with "-g 50" it will take considerable time. Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time. If you experiment with this data, I suggest using '-s 50 -regex "[ ]+"'.

      Digging into this, I've found that the tree pruning code sometimes creates surprising trees. One oddity I've observed is 0-count nodes, sometimes with non-zero children. The other is that sometimes subtrees seem to get repeated. I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

      1. addSynth.patch
        588 kB
        tom pierce
      2. logtrees.patch
        1 kB
        tom pierce
      3. MAHOUT-890.patch
        91 kB
        tom pierce
      4. MAHOUT-890-2.patch
        103 kB
        tom pierce
      5. MAHOUT-890-3.patch
        713 kB
        tom pierce
      6. simpleFPG.patch
        38 kB
        tom pierce
      7. smallexample.dat
        0.1 kB
        tom pierce

        Issue Links

          Activity

          tom pierce created issue -
          tom pierce made changes -
          Field Original Value New Value
          Attachment addSynth.patch [ 12504410 ]
          Attachment smallexample.dat [ 12504411 ]
          Attachment logtrees.patch [ 12504412 ]
          tom pierce made changes -
          Attachment simpleFPG.patch [ 12504816 ]
          tom pierce made changes -
          Attachment MAHOUT-890.patch [ 12506009 ]
          tom pierce made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to MAHOUT-920 [ MAHOUT-920 ]
          tom pierce made changes -
          Attachment MAHOUT-890-2.patch [ 12508956 ]
          tom pierce made changes -
          Attachment MAHOUT-890-3.patch [ 12508970 ]
          Robin Anil made changes -
          Assignee Robin Anil [ robinanil ]
          Robin Anil made changes -
          Fix Version/s 0.6 [ 12316364 ]
          Robin Anil made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Sean Owen made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Robin Anil
              Reporter:
              tom pierce
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development