[SPARK-43493] Add a max distance argument to the levenshtein() function - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.5.0
Component/s: SQL
Labels:
None

Description

Currently, Spark's levenshtein(str1, str2) function can be very inefficient for long strings. Many other databases which support this type of built-in function also take a third argument which signifies a maximum distance after which it is okay to terminate the algorithm.

For example something like

levenshtein(str1, str2[, max_distance])

the function stops computing the distant once the max values is reached.
See postgresql for an example of a 3 argument levenshtein.

Attachments

Issue Links

links to

[Github] Pull Request #41169 (panbingkun)

[Github] Pull Request #41724 (panbingkun)

Activity

People

Assignee:: Pan Bingkun

Reporter:: Max Gekk

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/May/23 16:09

Updated:: 25/Jun/23 11:56

Resolved:: 23/May/23 17:17