Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Bug
-
2.3.1
-
None
-
None
Description
In the current implementation of FeatureHasher.transform, a simple modulo on the hashed value is used to determine the vector index, it's suggested to use a large integer value as the numFeature parameter
we found several issues regarding current implementation:
- Cannot get the feature name back by its index after featureHasher transform, for example. when getting feature importance from decision tree training followed by a FeatureHasher
- when index conflict, which is a great chance to happen especially when 'numFeature' is relatively small, its value would be changed with a new valued (sum of current and old value)
- to avoid confliction, we should set the 'numFeature' with a large number, highly sparse vector increase the computation complexity of model training