Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10105

Adding most k frequent words parameter to Word2Vec implementation

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib

      Description

      When training Word2Vec on a really big dataset, it's really hard to evaluate the right minCount parameter, it would really help having a parameter to choose how many words you want to be in the vocabulary.
      Furthermore, the original Word2Vec paper, state that they took into account the most frequent 1M words.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tmnd91 Antonio Murgia

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment