[SPARK-1473] Feature selection for high dimensional datasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: MLlib
Labels:
- features

Description

For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n > 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall).

A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable.

Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data.

Relevant research:

Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
likelihood maximisation: a unifying framework for information theoretic
feature selection.The Journal of Machine Learning Research, 13, 27-66.
Forman, George. "An extensive empirical study of feature selection metrics for text classification." The Journal of machine learning research 3 (2003): 1289-1305.

Attachments

Issue Links

contains

SPARK-5491 Chi-square feature selection

Resolved

SPARK-6531 An Information Theoretic Feature Selection Framework

Resolved

links to

[Github] Pull Request #1484 (avulanov)

Activity

People

Assignee:: Unassigned

Reporter:: Ignacio Zendejas

Votes:: 7 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 11/Apr/14 20:57

Updated:: 16/Nov/15 09:51

Resolved:: 16/Nov/15 09:51