Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32662

CountVectorizerModel: Remove requirement for minimum vocabulary size

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.1.0
    • Component/s: ML, MLlib
    • Labels:
      None

      Description

      Currently `CountVectorizer.scala` has the following requirement:

      require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")

      But this is not a necessary constraint. It should be able to function even for empty vocabulary case.

      This also gives the ability to run the model over empty datasets. HashingTF works fine in such scenarios. CountVectorizer doesn't.

       

      spark-user discussion reference: http://apache-spark-user-list.1001560.n3.nabble.com/Ability-to-have-CountVectorizerModel-vocab-as-empty-td38396.html

        Attachments

          Activity

            People

            • Assignee:
              purijatin Jatin Puri
              Reporter:
              purijatin Jatin Puri
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: