Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1168

Balance datasets

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • v1.14
    • Module: Sampling
    • None

    Description

      From [1] here is the motivation behind balancing datasets:

      “Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.

      One way of addressing this issue is by re-sampling the dataset as to offset this imbalance with the hope of arriving at a more robust and fair decision boundary than you would otherwise.

      Re-sampling techniques can be divided in these categories:

      • Under-sampling the majority class(es).
      • Over-sampling the minority class.
      • Combining over- and under-sampling.
      • Create ensemble balanced sets.”

      There is an extensive literature on balancing datasets. The plan for MADlib in the initial phase is to offer basic functionality that can be extended in later phases based on feedback from users.

      Please see attached document for proposed scope of this story.

      References

      [1] imbalance-learn Python project
      http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
      https://github.com/scikit-learn-contrib/imbalanced-learn

      Attachments

        1. MADlib_ Balanced Sampling.pdf
          193 kB
          Frank McQuillan
        2. MADlib_Balance_Datasets_Requirements_v2.pdf
          266 kB
          Frank McQuillan
        3. MADlib Balance Datasets Requirements.pdf
          264 kB
          Frank McQuillan

        Activity

          People

            fmcquillan Frank McQuillan
            fmcquillan Frank McQuillan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: