HashingTF uses the Scala native hashing ## implementation. There are two significant problems with this.
First, per the Scala documentation for hashCode, the implementation is platform specific. This means that feature vectors created on one platform may be different than vectors created on another platform. This can create significant problems when a model trained offline is used in another environment for online prediction. The problem is made harder by the fact that following a hashing transform features lose human-tractable meaning and a problem such as this may be extremely difficult to track down.
Second, the native Scala hashing function performs badly on longer strings, exhibiting 200-500% higher collision rates than, for example, MurmurHash3 which is also included in the standard Scala libraries and is the hashing choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If Spark users apply HashingTF only to very short, dictionary-like strings the hashing function choice will not be a big problem but why have an implementation in MLlib with this limitation when there is a better implementation readily available in the standard Scala library?
Switching to MurmurHash3 solves both problems. If there is agreement that this is a good change, I can prepare a PR.
Note that changing the hash function would mean that models saved with a previous version would have to be re-trained. This introduces a problem that's orthogonal to breaking changes in APIs: breaking changes related to artifacts, e.g., a saved model, produced by a previous version. Is there a policy or best practice currently in effect about this? If not, perhaps we should come up with a few simple rules about how we communicate these in release notes, etc.