Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-1736

Add CountVectorizer to machine learning library

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:

      Description

      A CountVectorizer feature extractor [1] assigns each occurring word in a corpus an unique identifier. With this mapping it can vectorize models such as bag of words or ngrams in a efficient way. The unique identifier assigned to a word acts as the index of a vector. The number of word occurrences is represented as a vector value at a specific index.

      The advantage of the CountVectorizer compared to the FeatureHasher is that the mapping of words to indices can be obtained which makes it easier to understand the resulting feature vectors.

      The CountVectorizer could be generalized to support arbitrary feature values.

      The CountVectorizer should be implemented as a Transfomer.

      Resources:
      [1] http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

        Activity

        Hide
        till.rohrmann Till Rohrmann added a comment -

        I actually implemented a simple CountVectorizer for one of my presentations [1]. I thought about making a PR out of it.

        [1] https://github.com/tillrohrmann/flink/blob/zeppelin/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/CountVectorizer.scala

        Show
        till.rohrmann Till Rohrmann added a comment - I actually implemented a simple CountVectorizer for one of my presentations [1] . I thought about making a PR out of it. [1] https://github.com/tillrohrmann/flink/blob/zeppelin/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/CountVectorizer.scala
        Hide
        sachingoel0101 Sachin Goel added a comment -

        Hi Alexander, are there any updates on this?

        Show
        sachingoel0101 Sachin Goel added a comment - Hi Alexander, are there any updates on this?

          People

          • Assignee:
            Roshani19 ROSHANI NAGMOTE
            Reporter:
            till.rohrmann Till Rohrmann
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development