Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23166

Add maxDF Parameter to CountVectorizer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.2.1
    • 2.4.0
    • ML
    • None

    Description

      Currently, the CountVectorizer has a minDF parameter.

      It might be useful to also have a maxDF parameter.
      It will be used as a threshold for filtering all the terms that occur very frequently in a text corpus, because they are not very informative or could even be stop-words.

      This is analogous to scikit-learn, CountVectorizer, max_df.

      Attachments

        Issue Links

          Activity

            People

              yacine.mazari Yacine Mazari
              yacine.mazari Yacine Mazari
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: