Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23469

HashingTF should use corrected MurmurHash3 implementation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 3.0.0
    • ML
    • None
    • Hide
      In Spark 3.0, the HashingTF Transformer uses a corrected implementation of the murmur3 hash function to hash elements to vectors. HashingTF fit with Spark 3.0 will map elements to different positions in vectors than in Spark 2. However, HashingTF created with Spark 2.x and loaded with Spark 3.0 will still use the previous hash function and will not change behavior.
      Show
      In Spark 3.0, the HashingTF Transformer uses a corrected implementation of the murmur3 hash function to hash elements to vectors. HashingTF fit with Spark 3.0 will map elements to different positions in vectors than in Spark 2. However, HashingTF created with Spark 2.x and loaded with Spark 3.0 will still use the previous hash function and will not change behavior.

    Description

      SPARK-23381 added a corrected MurmurHash3 implementation but left the old implementation alone. In Spark 2.3 and earlier, HashingTF will use the old implementation. (We should not backport a fix for HashingTF since it would be a major change of behavior.) But we should correct HashingTF in Spark 2.4; this JIRA is for tracking this fix.

      • Update HashingTF to use new implementation of MurmurHash3
      • Ensure backwards compatibility for ML persistence by having HashingTF use the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded. We can add a Param to allow this.

      Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I recommend we first migrate the code to spark.ml: SPARK-21748. We can leave spark.mllib alone and just fix MurmurHash3 in spark.ml.

      Attachments

        Issue Links

          Activity

            People

              huaxingao Huaxin Gao
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: