Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-1742

Sample data points for MultipleLinearRegression to support proper SGD

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • None
    • None

    Description

      Currently the stochastic gradient descent method is applied to all data points of the MultipleLinearRegression implementation. In order to scale to huge data sets, each MultipleLinearRegression iteration should perform the SGD only on a random subset of data points. Therefore, proper data point sampling should be added to the MultipleLinearRegression implementation.

      An easy implementation would simply be a filter which flips for each data point a coin deciding whether to take or to discard it. The downside of this approach is that the whole data set has to be processed. It would be beneficial if a sampling operator does not have to process the whole data set given that it knows the data set's size. This assumption should be true for cached data sets in an iteration.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: