Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25412

FeatureHasher would change the value of output feature

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Bug
    • 2.3.1
    • None
    • ML
    • None

    Description

      In the current implementation of FeatureHasher.transform, a simple modulo on the hashed value is used to determine the vector index, it's suggested to use a large integer value as the numFeature parameter

      we found several issues regarding current implementation: 

      1. Cannot get the feature name back by its index after featureHasher transform, for example. when getting feature importance from decision tree training followed by a FeatureHasher
      2. when index conflict, which is a great chance to happen especially when 'numFeature' is relatively small, its value would be changed with a new valued (sum of current and old value)
      3.  to avoid confliction, we should set the 'numFeature' with a large number, highly sparse vector increase the computation complexity of model training

      Attachments

        Activity

          People

            Unassigned Unassigned
            VinceXie Vincent
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: