Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17455

IsotonicRegression takes non-polynomial time for some inputs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0
    • 2.2.0
    • MLlib
    • None

    Description

      The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently in MLlib can take O(N!) time for certain inputs, when it should have worst-case complexity of O(N^2).

      To reproduce this, I pulled the private method poolAdjacentViolators out of mllib.regression.IsotonicRegression and into a benchmarking harness.

      Given this input

      val x = (1 to length).toArray.map(_.toDouble)
      val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi}
      val w = Array.fill(length)(1d)
      
      val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)}
      

      I vary the length of the input to get these timings:

      Input Length Time (us)
      100 1.35
      200 3.14
      400 116.10
      800 2134225.90

      (tests were performed using https://github.com/sirthias/scala-benchmarking-template)

      I can also confirm that I run into this issue on a real dataset I'm working on when trying to calibrate random forest probability output. Some partitions take > 12 hours to run. This isn't a skew issue, since the largest partitions finish in minutes. I can only assume that some partitions cause something approaching this worst-case complexity.

      I'm working on a patch that borrows the implementation that is used in scikit-learn and the R "iso" package, both of which handle this particular input in linear time and are quadratic in the worst case.

      Attachments

        Activity

          People

            nseggert Nic Eggert
            nseggert Nic Eggert
            Joseph K. Bradley Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: