Description
In order to debug performance issues when training mllib algorithms,
it is useful to log some metrics about the training dataset, the training parameters, etc.
This ticket is an umbrella to add some simple logging messages to the most common MLlib estimators. There should be no performance impact on the current implementation, and the output is simply printed in the logs.
Here are some values that are of interest when debugging training tasks:
- number of features
- number of instances
- number of partitions
- number of classes
- input RDD/DF cache level
- hyper-parameters
Attachments
Attachments
1.
|
Log instrumentation in logistic regression as a first task | Resolved | Timothy Hunter | ||
2.
|
Log instrumentation in KMeans | Resolved | Xin Ren | ||
3.
|
Log instrumentation in Random forests | Resolved | Benjamin Fradet | ||
4.
|
Log instrumentation in ALS | Resolved | Miao Wang | ||
5.
|
Log instrumentation in GBTs | Resolved | Seth Hendrickson | ||
6.
|
Log instrumentation in GMM | Resolved | Ruifeng Zheng | ||
7.
|
Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit | Resolved | Sue Ann Hong | ||
8.
|
Log instrumentation in CrossValidator | Closed | Unassigned | ||
9.
|
Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg | Resolved | Ruifeng Zheng |