[FLINK-1736] Add CountVectorizer to machine learning library - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Won't Do
Affects Version/s: None
Fix Version/s: None
Component/s: Library / Machine Learning
Labels:
- ML
- Starter

Description

A CountVectorizer feature extractor [1] assigns each occurring word in a corpus an unique identifier. With this mapping it can vectorize models such as bag of words or ngrams in a efficient way. The unique identifier assigned to a word acts as the index of a vector. The number of word occurrences is represented as a vector value at a specific index.

The advantage of the CountVectorizer compared to the FeatureHasher is that the mapping of words to indices can be obtained which makes it easier to understand the resulting feature vectors.

The CountVectorizer could be generalized to support arbitrary feature values.

The CountVectorizer should be implemented as a Transfomer.

Resources:
[1] http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

Attachments

Activity

People

Assignee:: ROSHANI NAGMOTE

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Mar/15 15:07

Updated:: 28/Feb/19 22:57

Resolved:: 28/Feb/19 22:57