Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-890

Performance issue in FPGrowth

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.6
    • 0.6
    • None
    • None

    Description

      I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation. This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution. I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.

      FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines. If run in sequential mode, or with "-g 50" it will take considerable time. Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time. If you experiment with this data, I suggest using '-s 50 -regex "[ ]+"'.

      Digging into this, I've found that the tree pruning code sometimes creates surprising trees. One oddity I've observed is 0-count nodes, sometimes with non-zero children. The other is that sometimes subtrees seem to get repeated. I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

      Attachments

        1. smallexample.dat
          0.1 kB
          Tom Pierce
        2. simpleFPG.patch
          38 kB
          Tom Pierce
        3. MAHOUT-890-3.patch
          713 kB
          Tom Pierce
        4. MAHOUT-890-2.patch
          103 kB
          Tom Pierce
        5. MAHOUT-890.patch
          91 kB
          Tom Pierce
        6. logtrees.patch
          1 kB
          Tom Pierce
        7. addSynth.patch
          588 kB
          Tom Pierce

        Issue Links

          Activity

            People

              robinanil Robin Anil
              tcp Tom Pierce
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: