Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-1736

Add CountVectorizer to machine learning library

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:

      Description

      A CountVectorizer feature extractor [1] assigns each occurring word in a corpus an unique identifier. With this mapping it can vectorize models such as bag of words or ngrams in a efficient way. The unique identifier assigned to a word acts as the index of a vector. The number of word occurrences is represented as a vector value at a specific index.

      The advantage of the CountVectorizer compared to the FeatureHasher is that the mapping of words to indices can be obtained which makes it easier to understand the resulting feature vectors.

      The CountVectorizer could be generalized to support arbitrary feature values.

      The CountVectorizer should be implemented as a Transfomer.

      Resources:
      [1] http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

        Attachments

          Activity

            People

            • Assignee:
              Roshani19 ROSHANI NAGMOTE
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: