Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10105

Adding most k frequent words parameter to Word2Vec implementation

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib

      Description

      When training Word2Vec on a really big dataset, it's really hard to evaluate the right minCount parameter, it would really help having a parameter to choose how many words you want to be in the vocabulary.
      Furthermore, the original Word2Vec paper, state that they took into account the most frequent 1M words.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tmnd91 Antonio Murgia
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: