Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Implemented
-
None
Description
Right now spot-ml executes Spark SQL joins to retrieve the topic distribution from LDA to the list of records and that is taking a long time for big datasets.
An idea is to pass a new autoBroadcastJoinThreshold to spot-ml's Spark job so Spark can auto broadcast something bigger than the default threshold (10MB). With that change, we need to reduce the amount of data being broadcasted by changing the precision of probabilities from 64 bit (Double) to 32 bit (Float).
Both, autoBroadcastJoinThreshold and precision should be something users can configure so they can decide whether to change precision and or broadcast something bigger than default threshold or not.