Details
-
Request
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Not sure if this is already somewhere in JIRA but just wanted to file it just in case.
I've done some preliminary testing on performance of the CoxPH module, and I think there may be opportunities for improvement. Based on past discussions w/ the core MADlib team, it sounded like there were potentially some improvements which can be made so that CoxPH is closer to the optimality of the ARIMA module in terms of performance (computations for ARIMA, like CoxPH, is dependent on ordering of observations).
Note that dataset used for testing was artificially blown up to various number of records, so this resulted in many ties in the dataset.
The links below are Pivotal-only access at the moment.
Results from preliminary testing (various number of observations, 7 explanatory variables):
https://drive.google.com/a/gopivotal.com/file/d/0B9bfZ-YiuzxQYkdJNDdLZ3owY2s/edit?usp=sharing
Code used for the above results (for version w/ 8 million records):
https://drive.google.com/a/gopivotal.com/file/d/0B9bfZ-YiuzxQbHdGNHF1ejcxVUk/edit?usp=sharing
Sample "seed" dataset (note that the number of records in this dataset was blown up artificially during testing):
https://drive.google.com/a/gopivotal.com/file/d/0B9bfZ-YiuzxQbGtJVWM3ejFXZnM/edit?usp=sharing