Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-658

UH-Mine algorithm for frequent pattern mining of uncertain data

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Later
    • 0.4, 0.5
    • None
    • None

    Description

      Proposal Title: UH-Mine algorithm for frequent pattern mining of uncertain data

      Student Name: Yaroslav Hayduk
      Student E-mail: yarcoh@gmail.com
      Organization/Project: Apache Mahout
      Assigned Mentor:

      Abstract
      Frequent pattern mining detects frequent patterns in data, which in turn permits data analysts determine interesting correlations in the dataset. In uncertain data mining, we suspect, but cannot guarantee, the presence or absence of an item. Currently, Apache
      Mahout does not contain algorithm implementations capable of mining frequent patterns from uncertain data. Hence, I propose to implement UH-Mine during GSoC 2011, used for frequent pattern mining of uncertain data.

      Motivation
      There are many real-life situations, where we observe uncertain data, such as in temperature and wind speed readings, patient diagnosis, and satellite imaging. Due to the probabilistic nature of this data, it takes more time and resources to mine it. Current state of-the-art frequent pattern mining of uncertain data algorithms do not provide sufficient performance results, as most of them are not crafted to execute in parallel.
      In uncertain data mining, we suspect, but cannot guarantee, the presence or absence of an item. For example, when a doctor is 90% sure that a patient suffers from a particular disease. A video lecture by Jian Pei [3] contains a thorough overview of the uncertain data problem domain as well as elaborates on the process of frequent pattern mining of uncertain data.

      Comparison of UF-Growth and UH-Mine
      The UF-Growth [4] algorithm modifies the FP-Growth [2] algorithm in the way the transaction tree is built. FP-Growth uses the FP-Tree, a tree-based data structure, to store a compact representation of the transaction database, which contains frequency information of all frequent items. The FP-Tree, however, does not store existential probabilities, associated with items. Hence, Leung et al. [4] created a data structure, called UF-Tree, which is based on the FP-tree. Each node in the UF-Tree stores an item, its expected support, as well as the number of occurrence of such expected support for each item. To merge the transaction with the child node in UF-Tree, UF-Growth requires both the item and its corresponding existential probability to match. Hence, the algorithm arrives with the UF-Tree, having a much lower compression ratio, that in the FP-Tree, constructed by FP-Growth.
      As such, I propose to adapt and implement the UH-Mine algorithms to the MapReduce programming model. The UH-Struct structure uses the linkage behaviour among transactions corresponding to a branch of the FP-Tree(UF-Tree) without actually creating a projected database.
      This approach is better than FP-Tree even in the deterministic case, when compression from FP-Tree is not high. This turns out to be particularly true for the uncertain case, as discussed earlier. The UH-mine [1] algorithm provides the best trade-offs both in terms of running time and memory usage.

      The UH-Mine algorithm works as follows: it
      1. prunes the initial DB such that all singleton infrequent items are removed;
      2. divides the pruned DB into equal chunks;
      3. mines these chunks separately using the UH-Mine(mem) algorithm. To mine frequent patterns, UH-Mine(mem) maintains an H-Struct, which contains pointers to transaction items. At each step, the algorithm adjusts these pointers and does not incur the overheads, associated with the FP(UF)-Tree construction;
      4. joins the results and
      5. scans the pruned DB once again to remove false positives and obtain the actual counts.

      Benefits for the Mahout community
      a) My work adds an algorithm implementation for discovering frequent patterns in uncertain data
      b) If my algorithm implementation proves to be fast, it would permit future algorithm developers to retrofit my UH-Mine implementation to create H-Mine[5]. H-Mine is an alternative algorithm, which mines frequent patterns from precise data.

      Timeline
      Weeks 1-4: Implement UH-Mine in Java, set up a local Hadoop cluster composed of 3 machines Weeks 5-7: Investigate Mahout code structure, start Adopting UH-Mine to MapReduce.
      Weeks 8: Summit for mid-term evaluation
      Weeks 9 - 11: Finish-up with my implementation, refactor code smells, identify and fix performance issues
      Weeks 11 - 12: Code cleaning, documenting and testing.

      Biography
      My name is Yarco Hayduk and I am a Comp Sci graduate student at the University of Manitoba, Canada. I'm very interested in concurrency and data mining. Currently, I am working on a similar project, which adopts the UFP-Growth[6] algorithm to MapReduce. I'm using the Parallel FP-Growth algorithm as a starting point for implementing UFP-Growth. I also implemented UH-Mine and H-Mine algorithms in C. Thus, I would use it as a reference for my proposed UH-Mine implementation on MapReduce.

      References
      [1] Jianyong Wang Charu C. Aggarwal, Yan Li and Jing Wang. Frequent pattern mining
      with uncertain data. In KDD '09: 15th ACM SIGKDD International Conference on
      Knowledge Discovery and Data Mining, pages 29--38, Paris, France, June 2010. ACM.
      10.1145/1557019.1557030.
      [2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In SIGMOD '00: 2000 ACM SIGMOD international conference on management
      of data, pages 1--12, Dallas, TX, USA, May 2000. ACM. 10.1145/342009.335372.
      [3] Ming Hua Jian Pei. Mining uncertain and probabilistic data: problems, challenges,
      methods, and applications. http://videolectures.net/kdd08_pei_mupd, Accessed on
      March 11, 2011.
      [4] Carson Kai-Sang Leung, Christopher Lee Carmichael, and Boyu Hao. Efficient mining
      of frequent patterns from uncertain data. In ICDMW '07: 7th IEEE International Conference on Data Mining Workshops, pages 489--494, Omaha, NE, USA, October 2007.
      IEEE. 10.1109/ICDMW.2007.84.
      [5] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and Dongqing Yang.
      H-Mine: Hyper-structure mining of frequent patterns in large databases. In ICDM '01:
      1st IEEE International Conference on Data Mining, pages 441--448, San Jose, California,
      USA, November 2001. IEEE. 10.1109/ICDM.2001.989550.
      [6] Calin Garboni Toon Calders and Bart Goethals. Efficient pattern mining from uncertain data with sampling. In PAKDD 2010: 14th Pacific-Asia Conference on Knowledge
      Discovery and Data Mining, pages 480--487, Hyderabad, India, June 2010. Springer.
      10.1007/978-3-642-13657-351.

      Attachments

        Activity

          People

            Unassigned Unassigned
            yarco Yarco Hayduk
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: