Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation. This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution. I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.

      FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines. If run in sequential mode, or with "-g 50" it will take considerable time. Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time. If you experiment with this data, I suggest using '-s 50 -regex "[ ]+"'.

      Digging into this, I've found that the tree pruning code sometimes creates surprising trees. One oddity I've observed is 0-count nodes, sometimes with non-zero children. The other is that sometimes subtrees seem to get repeated. I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

        Attachments

        1. MAHOUT-890-3.patch
          713 kB
          tom pierce
        2. MAHOUT-890-2.patch
          103 kB
          tom pierce
        3. MAHOUT-890.patch
          91 kB
          tom pierce
        4. simpleFPG.patch
          38 kB
          tom pierce
        5. logtrees.patch
          1 kB
          tom pierce
        6. smallexample.dat
          0.1 kB
          tom pierce
        7. addSynth.patch
          588 kB
          tom pierce

          Issue Links

            Activity

              People

              • Assignee:
                robinanil Robin Anil
                Reporter:
                tcp tom pierce
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: