[SPARK-23693] SQL function uuid() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 2.2.1, 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Add function uuid() to org.apache.spark.sql.functions that returns Universally Unique ID.

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:

monotonically_increasing_id() function
row_number() function over some window
convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:

def uuid(): Column

that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String.

I already have a simple implementation based on java-uuid-generator. I can share it as a PR.

Attachments

Issue Links

links to

[Github] Pull Request #21055 (tashoyan)

GitHub Pull Request #21055

Activity

People

Assignee:: Unassigned

Reporter:: Arseniy Tashoyan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Mar/18 10:57

Updated:: 17/Aug/21 13:54

Resolved:: 08/Jan/19 02:42