[SPOT-161] [ML] Improve the time spot-ml takes to join LDA results and original dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 1.0
Labels:
- spot-ml
- spot-release

Epic Link:
Apache Spot (Incubating) Release 1.0

Description

Right now spot-ml executes Spark SQL joins to retrieve the topic distribution from LDA to the list of records and that is taking a long time for big datasets.

An idea is to pass a new autoBroadcastJoinThreshold to spot-ml's Spark job so Spark can auto broadcast something bigger than the default threshold (10MB). With that change, we need to reduce the amount of data being broadcasted by changing the precision of probabilities from 64 bit (Double) to 32 bit (Float).

Both, autoBroadcastJoinThreshold and precision should be something users can configure so they can decide whether to change precision and or broadcast something bigger than default threshold or not.

Attachments

Activity

People

Assignee:: Ricardo Barona

Reporter:: Ricardo Barona

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/May/17 20:28

Updated:: 12/Jul/17 22:50

Resolved:: 15/Jun/17 22:02