Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-629

FP Growth performance improvement

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 0.4, 0.5
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Instead of calculating patterns ending with given attribute multiple times they can be calculated just once. Depending on for how many features patterns are generated, speedup can be huge. More feature included - greater speedup. For test data set (88162 real life 'basket' transactions), if all features were selected (i.e. we want to generate patterns for all items in transactions), patterns generation time dropped from 1h 15min to 8sec. For parallel fpgrowth, where the number of requested features is limited the speedup is not that dramatic, but still noticeable. Basically work done is always smaller than before the patch (as patterns for each item are calculated at most once).

      The improvement is a variation of base algorithm in situation where we want to generate patterns for only subset of items (let's call this subset A). Given that items are ordered by descending frequency it is enough to calculate only patterns ending on any item with frequency smaller or equal to the most frequent item in the subset A. The heaps for each item are initialized upfront and merged after processing every item.

        Attachments

        1. FPGrowth.java
          34 kB
          niu
        2. performance_improvement.txt
          11 kB
          Jaroslaw Odzga

          Issue Links

            Activity

              People

              • Assignee:
                robinanil Robin Anil
                Reporter:
                jarek Jaroslaw Odzga
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: