[SPARK-10574] HashingTF should use MurmurHash3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 2.0.0
Component/s: MLlib
Labels:
- HashingTF
- hashing
- mllib

Target Version/s:

2.0.0

Description

HashingTF uses the Scala native hashing ## implementation. There are two significant problems with this.

First, per the Scala documentation for hashCode, the implementation is platform specific. This means that feature vectors created on one platform may be different than vectors created on another platform. This can create significant problems when a model trained offline is used in another environment for online prediction. The problem is made harder by the fact that following a hashing transform features lose human-tractable meaning and a problem such as this may be extremely difficult to track down.

Second, the native Scala hashing function performs badly on longer strings, exhibiting 200-500% higher collision rates than, for example, MurmurHash3 which is also included in the standard Scala libraries and is the hashing choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If Spark users apply HashingTF only to very short, dictionary-like strings the hashing function choice will not be a big problem but why have an implementation in MLlib with this limitation when there is a better implementation readily available in the standard Scala library?

Switching to MurmurHash3 solves both problems. If there is agreement that this is a good change, I can prepare a PR.

Note that changing the hash function would mean that models saved with a previous version would have to be re-trained. This introduces a problem that's orthogonal to breaking changes in APIs: breaking changes related to artifacts, e.g., a saved model, produced by a previous version. Is there a policy or best practice currently in effect about this? If not, perhaps we should come up with a few simple rules about how we communicate these in release notes, etc.

Attachments

Issue Links

blocks

SPARK-14735 PySpark HashingTF hashAlgorithm param + docs

Resolved

contains

SPARK-14735 PySpark HashingTF hashAlgorithm param + docs

Resolved

is duplicated by

SPARK-13968 Use MurmurHash3 for hashing String features

Closed

relates to

SPARK-14899 Remove spark.ml HashingTF hashingAlg option

Resolved

links to

[Github] Pull Request #12498 (yanboliang)

Activity

People

Assignee:: Yanbo Liang

Reporter:: Simeon Simeonov

Shepherd:: Joseph K. Bradley

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 12/Sep/15 05:22

Updated:: 25/Apr/16 19:15

Resolved:: 25/Apr/16 19:08