Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-805

CoxPH performance enhancements

    XMLWordPrintableJSON

    Details

    • Type: Request
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Not sure if this is already somewhere in JIRA but just wanted to file it just in case.

      I've done some preliminary testing on performance of the CoxPH module, and I think there may be opportunities for improvement. Based on past discussions w/ the core MADlib team, it sounded like there were potentially some improvements which can be made so that CoxPH is closer to the optimality of the ARIMA module in terms of performance (computations for ARIMA, like CoxPH, is dependent on ordering of observations).

      Note that dataset used for testing was artificially blown up to various number of records, so this resulted in many ties in the dataset.

      The links below are Pivotal-only access at the moment.

      Results from preliminary testing (various number of observations, 7 explanatory variables):

      https://drive.google.com/a/gopivotal.com/file/d/0B9bfZ-YiuzxQYkdJNDdLZ3owY2s/edit?usp=sharing

      Code used for the above results (for version w/ 8 million records):
      https://drive.google.com/a/gopivotal.com/file/d/0B9bfZ-YiuzxQbHdGNHF1ejcxVUk/edit?usp=sharing

      Sample "seed" dataset (note that the number of records in this dataset was blown up artificially during testing):
      https://drive.google.com/a/gopivotal.com/file/d/0B9bfZ-YiuzxQbGtJVWM3ejFXZnM/edit?usp=sharing

        Attachments

          Activity

            People

            • Assignee:
              riyer Rahul Iyer
              Reporter:
              jungw2 Woo Jung
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: