[SPARK-6509] MDLP discretizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: MLlib
Labels:
None

Description

Minimum Description Lenght Discretizer

This method implements Fayyad's discretizer [1] based on Minimum Description Length Principle (MDLP) in order to treat non discrete datasets from a distributed perspective. We have developed a distributed version from the original one performing some important changes.

Associated paper:

Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J. M. and Herrera, F. (2016), Data discretization: taxonomy and big data challenge. WIREs Data Mining Knowledge Discovery, 6: 5–21. doi:10.1002/widm.1173
URL: http://onlinelibrary.wiley.com/doi/10.1002/widm.1173/abstract

– Improvements on discretizer:

Support for sparse data.
Multi-attribute processing. The whole process is carried out in a single step when the number of boundary points per attribute fits well in one partition (<= 100K boundary points per attribute).
Support for attributes with a huge number of boundary points (> 100K boundary points per attribute). Rare situation.

This software has been proved with two large real-world datasets such as:

A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 competition, which comes from the Protein Structure Prediction field (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 631 attributes, 2 classes, 98% of negative examples and occupies, when uncompressed, about 56GB of disk space.
Epsilon dataset: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 400K instances and 2K attributes

We have demonstrated that our method performs 300 times faster than the sequential version for the first dataset, and also improves the accuracy for Naive Bayes.

Publication: S. Ramírez-Gallego, S. García, H. Mouriño-Talin, D. Martínez-Rego, V. Bolón, A. Alonso-Betanzos, J.M. Benitez, F. Herrera. "Data Discretization: Taxonomy and Big Data Challenge", WIRES Data Mining and Knowledge Discovery. In press, 2015.

Design doc: https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing

References

[1] Fayyad, U., & Irani, K. (1993).
"Multi-interval discretization of continuous-valued attributes for classification learning."

Attachments

Issue Links

Is contained by

SPARK-1303 Added discretization capability to MLlib.

Resolved

links to

[Github] Pull Request #5170 (sramirez)

Activity

People

Assignee:: Unassigned

Reporter:: Sergio Ramírez

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Mar/15 18:19

Updated:: 18/Apr/17 14:37

Resolved:: 23/Nov/15 09:32