Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Won't Do
-
None
-
None
Description
A CountVectorizer feature extractor [1] assigns each occurring word in a corpus an unique identifier. With this mapping it can vectorize models such as bag of words or ngrams in a efficient way. The unique identifier assigned to a word acts as the index of a vector. The number of word occurrences is represented as a vector value at a specific index.
The advantage of the CountVectorizer compared to the FeatureHasher is that the mapping of words to indices can be obtained which makes it easier to understand the resulting feature vectors.
The CountVectorizer could be generalized to support arbitrary feature values.
The CountVectorizer should be implemented as a Transfomer.
Resources:
[1] http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage