Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27099

Expose xxHash64 as a flexible 64-bit column hash like `hash`

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.3, 2.4.0
    • 3.0.0
    • SQL
    • None

    Description

      I’m working on something that requires deterministic randomness, i.e. a row gets the same “random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect way to do this, but the existing hashes have various limitations:

      • hash: 32-bit output (only 4 billion possibilities will result in a lot of collisions for many tables: the birthday paradox implies >50% chance of at least one for tables larger than 77000 rows, and likely ~1.6 billion collisions in a table of size 4 billion)
      • sha1/sha2/md5: single binary column input, string output

      It seems there’s already support for a 64-bit hash function that can work with an arbitrary number of arbitrary-typed columns (XxHash64), which could be exposed as xxHash64 or xxhash64 (or similar).

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            huonw Huon Wilson
            huonw Huon Wilson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment