Uploaded image for project: 'Spot (Retired)'
  1. Spot (Retired)
  2. SPOT-161

[ML] Improve the time spot-ml takes to join LDA results and original dataset

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • None
    • 1.0

    Description

      Right now spot-ml executes Spark SQL joins to retrieve the topic distribution from LDA to the list of records and that is taking a long time for big datasets.

      An idea is to pass a new autoBroadcastJoinThreshold to spot-ml's Spark job so Spark can auto broadcast something bigger than the default threshold (10MB). With that change, we need to reduce the amount of data being broadcasted by changing the precision of probabilities from 64 bit (Double) to 32 bit (Float).

      Both, autoBroadcastJoinThreshold and precision should be something users can configure so they can decide whether to change precision and or broadcast something bigger than default threshold or not.

      Attachments

        Activity

          People

            rabarona Ricardo Barona
            rabarona Ricardo Barona
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: