Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-1736

Add CountVectorizer to machine learning library

    XMLWordPrintableJSON

Details

    Description

      A CountVectorizer feature extractor [1] assigns each occurring word in a corpus an unique identifier. With this mapping it can vectorize models such as bag of words or ngrams in a efficient way. The unique identifier assigned to a word acts as the index of a vector. The number of word occurrences is represented as a vector value at a specific index.

      The advantage of the CountVectorizer compared to the FeatureHasher is that the mapping of words to indices can be obtained which makes it easier to understand the resulting feature vectors.

      The CountVectorizer could be generalized to support arbitrary feature values.

      The CountVectorizer should be implemented as a Transfomer.

      Resources:
      [1] http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

      Attachments

        Activity

          People

            Roshani19 ROSHANI NAGMOTE
            trohrmann Till Rohrmann
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: