Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1378

Preprocessor should evenly distribute data on an arbitrary number of segments

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • v1.17
    • Deep Learning
    • None

    Description

      We need to implement a feature for the preprocessor to generate distribution keys that ensure gpdb will distribute the data in a controlled way.

      We want to assign the distribution key such that all data on each segment has a unique distribution key common for all rows in that segment.

      Currently, `training_preprocessor_d`l and `validation_preprocessor_dl` doesn't guarantee even distribution of the data among segments, especially when the number of buffers is not much larger than the number of segments.

      We should fix the preprocessor so that it always distributes the data as evenly as possible among the segments.

      Another problem is that often training with too large a number of segments results in slower accuracy convergence--the optimal number of segments will not usually match the total number of segments in the cluster. For this reason or others, a user may wish to use only a subset of the segments available.

      We should add a `num_segments` option to both preprocessors, and ensure that data is distributed evenly among those segments. It should throw an error if the number of segments passed in is larger than the total number of segments, and default to using all segments.

      Attachments

        Activity

          People

            Unassigned Unassigned
            yhzhang Yuhao Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: