Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation. This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution. I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.

      FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines. If run in sequential mode, or with "-g 50" it will take considerable time. Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time. If you experiment with this data, I suggest using '-s 50 -regex "[ ]+"'.

      Digging into this, I've found that the tree pruning code sometimes creates surprising trees. One oddity I've observed is 0-count nodes, sometimes with non-zero children. The other is that sometimes subtrees seem to get repeated. I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

        Attachments

        1. addSynth.patch
          588 kB
          tom pierce
        2. logtrees.patch
          1 kB
          tom pierce
        3. MAHOUT-890.patch
          91 kB
          tom pierce
        4. MAHOUT-890-2.patch
          103 kB
          tom pierce
        5. MAHOUT-890-3.patch
          713 kB
          tom pierce
        6. simpleFPG.patch
          38 kB
          tom pierce
        7. smallexample.dat
          0.1 kB
          tom pierce

          Issue Links

            Activity

              People

              • Assignee:
                robinanil Robin Anil
                Reporter:
                tcp tom pierce
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: