Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11813

Avoid serialization of vocab in Word2Vec

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.1.1, 1.2.2, 1.3.1, 1.4.2, 1.5.1, 1.6.0
    • Fix Version/s: 1.1.2, 1.2.3, 1.3.2, 1.4.2, 1.5.2, 1.6.0
    • Component/s: MLlib
    • Labels:
      None

      Description

      Avoid serialization of vocab in Word2Vec, 2 benefits.

      1. Performance improvement for less serialization.

      2. This can actually increase the capacity of Word2Vec.
      Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
      The main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab;
      2 global table: vocab * vectorSize * 8.

      Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

        Attachments

          Activity

            People

            • Assignee:
              yuhaoyan yuhao yang
              Reporter:
              yuhaoyan yuhao yang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: