[SPARK-17595] Inefficient selection in Word2VecModel.findSynonyms - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.1.0
Component/s: MLlib
Labels:
None

Description

The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary.

Attachments

Issue Links

links to

[Github] Pull Request #15150 (willb)

Activity

People

Assignee:: William Benton

Reporter:: William Benton

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Sep/16 14:35

Updated:: 21/Sep/16 08:45

Resolved:: 21/Sep/16 08:45