Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-629

FP Growth performance improvement

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 0.4, 0.5
    • None
    • None
    • None

    Description

      Instead of calculating patterns ending with given attribute multiple times they can be calculated just once. Depending on for how many features patterns are generated, speedup can be huge. More feature included - greater speedup. For test data set (88162 real life 'basket' transactions), if all features were selected (i.e. we want to generate patterns for all items in transactions), patterns generation time dropped from 1h 15min to 8sec. For parallel fpgrowth, where the number of requested features is limited the speedup is not that dramatic, but still noticeable. Basically work done is always smaller than before the patch (as patterns for each item are calculated at most once).

      The improvement is a variation of base algorithm in situation where we want to generate patterns for only subset of items (let's call this subset A). Given that items are ordered by descending frequency it is enough to calculate only patterns ending on any item with frequency smaller or equal to the most frequent item in the subset A. The heaps for each item are initialized upfront and merged after processing every item.

      Attachments

        1. performance_improvement.txt
          11 kB
          Jaroslaw Odzga
        2. FPGrowth.java
          34 kB
          niu

        Issue Links

          Activity

            People

              robinanil Robin Anil
              jarek Jaroslaw Odzga
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: