Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11898

Use broadcast for the global tables in Word2Vec

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.2
    • 2.0.0
    • MLlib
    • None

    Description

      syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization.

      Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help,
      1. decrease the worker memory consumption by 45%.
      2. decrease running time by 40%.

      Attachments

        Issue Links

          Activity

            People

              yuhaoyan yuhao yang
              yuhaoyan yuhao yang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: