Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23693

SQL function uuid()

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 2.2.1, 2.3.0
    • None
    • SQL
    • None

    Description

      Add function uuid() to org.apache.spark.sql.functions that returns Universally Unique ID.

      Sometimes it is necessary to uniquely identify each row in a DataFrame.

      Currently the following ways are available:

      • monotonically_increasing_id() function
      • row_number() function over some window
      • convert the DataFrame to RDD and zipWithIndex()

      All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs.

      The proposed solution is to add new function:

      def uuid(): Column
      

      that returns String representation of UUID.

      UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String.

      I already have a simple implementation based on java-uuid-generator. I can share it as a PR.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tashoyan Arseniy Tashoyan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: