Description
Currently, the CountVectorizer has a minDF parameter.
It might be useful to also have a maxDF parameter.
It will be used as a threshold for filtering all the terms that occur very frequently in a text corpus, because they are not very informative or could even be stop-words.
This is analogous to scikit-learn, CountVectorizer, max_df.
Attachments
Issue Links
- relates to
-
SPARK-23615 Add maxDF Parameter to Python CountVectorizer
- Resolved
- links to