Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23469

HashingTF should use corrected MurmurHash3 implementation

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 3.0.0
    • ML
    • None
    • Hide
      In Spark 3.0, the HashingTF Transformer uses a corrected implementation of the murmur3 hash function to hash elements to vectors. HashingTF fit with Spark 3.0 will map elements to different positions in vectors than in Spark 2. However, HashingTF created with Spark 2.x and loaded with Spark 3.0 will still use the previous hash function and will not change behavior.
      Show
      In Spark 3.0, the HashingTF Transformer uses a corrected implementation of the murmur3 hash function to hash elements to vectors. HashingTF fit with Spark 3.0 will map elements to different positions in vectors than in Spark 2. However, HashingTF created with Spark 2.x and loaded with Spark 3.0 will still use the previous hash function and will not change behavior.

    Description

      SPARK-23381 added a corrected MurmurHash3 implementation but left the old implementation alone. In Spark 2.3 and earlier, HashingTF will use the old implementation. (We should not backport a fix for HashingTF since it would be a major change of behavior.) But we should correct HashingTF in Spark 2.4; this JIRA is for tracking this fix.

      • Update HashingTF to use new implementation of MurmurHash3
      • Ensure backwards compatibility for ML persistence by having HashingTF use the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded. We can add a Param to allow this.

      Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I recommend we first migrate the code to spark.ml: SPARK-21748. We can leave spark.mllib alone and just fix MurmurHash3 in spark.ml.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            huaxingao Huaxin Gao
            josephkb Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment