Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-805

CoxPH performance enhancements

    XMLWordPrintableJSON

Details

    • Request
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Not sure if this is already somewhere in JIRA but just wanted to file it just in case.

      I've done some preliminary testing on performance of the CoxPH module, and I think there may be opportunities for improvement. Based on past discussions w/ the core MADlib team, it sounded like there were potentially some improvements which can be made so that CoxPH is closer to the optimality of the ARIMA module in terms of performance (computations for ARIMA, like CoxPH, is dependent on ordering of observations).

      Note that dataset used for testing was artificially blown up to various number of records, so this resulted in many ties in the dataset.

      The links below are Pivotal-only access at the moment.

      Results from preliminary testing (various number of observations, 7 explanatory variables):

      https://drive.google.com/a/gopivotal.com/file/d/0B9bfZ-YiuzxQYkdJNDdLZ3owY2s/edit?usp=sharing

      Code used for the above results (for version w/ 8 million records):
      https://drive.google.com/a/gopivotal.com/file/d/0B9bfZ-YiuzxQbHdGNHF1ejcxVUk/edit?usp=sharing

      Sample "seed" dataset (note that the number of records in this dataset was blown up artificially during testing):
      https://drive.google.com/a/gopivotal.com/file/d/0B9bfZ-YiuzxQbGtJVWM3ejFXZnM/edit?usp=sharing

      Attachments

        Activity

          People

            riyer Rahul Iyer
            jungw2 Woo Jung
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: