Now that we have StringIndexer, we could have spark.ml.classification.Classifier (the abstraction) automatically handle label indexing if the labels are not yet indexed.
This would require a bit of design:
- Should predict() output the original labels or the indices?
- How should we notify users that the labels are being automatically indexed?
- How should we provide that index to the users?
- If multiple parts of a Pipeline automatically index labels, what do we need to do to make sure they are consistent?