[SPARK-23599] The UUID() expression is too non-deterministic - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.1, 2.4.0
Component/s: SQL
Labels:
None

Target Version/s:

2.4.0

Description

The current Uuid() expression uses java.util.UUID.randomUUID for UUID generation. There are a couple of major problems with this:

It is non-deterministic across task retries. This breaks Spark's processing model, and this will to very hard to trace bugs, like non-deterministic shuffles, duplicates and missing rows.
It uses a single secure random for UUID generation. This uses a single JVM wide lock, and this can lead to lock contention and other performance problems.

We should move to something that is deterministic between retries. This can be done by using seeded PRNGs for which we set the seed during planning. It is important here to use a PRNG that provides enough entropy for creating a proper UUID.

Attachments

Issue Links

relates to

SPARK-23794 UUID() should be stateful

Resolved

links to

[Github] Pull Request #20817 (viirya)

[Github] Pull Request #20861 (viirya)

[Github] Pull Request #20903 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Herman van Hövell

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 05/Mar/18 13:27

Updated:: 30/Dec/21 18:50

Resolved:: 22/Mar/18 18:58