In the new version of Word2Vec, the model saving was modified to estimate an appropriate number of partitions based on the kryo buffer size. This is a great improvement, but there is a caveat for very large models.
The (word, vector) tuple goes through a transformation to a local case class of Data(word, vector)... I can only assume this is for the kryo serialization process. The new version of the code iterates over the entire vocabulary to do this transformation (the old version wrapped the entire datum) in the driver's heap. Only to have the result then distributed to the cluster to be written into it's parquet files.
With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, and tri-grams), that local driver transformation is causing the driver to hang indefinitely in GC as I can only assume that it's generating millions of short lived objects which can't be evicted fast enough.
Perhaps I'm overlooking something, but it seems to me that since the result is distributed over the cluster to be saved after the transformation anyway, we may as well distribute it first, allowing the cluster resources to do the transformation more efficiently, and then write the parquet file from there.
I have a patch implemented, and am in the process of testing it at scale. I will open a pull request when I feel that the patch is successfully resolving the issue, and after making sure that it passes unit tests.