[SPARK-24884] Implement regexp_extract_all - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.1
Fix Version/s: 3.1.0
Component/s: SQL
Labels:
None

Description

I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like:

AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|

Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the existing regexp_extract method since the number of occurrences is always arbitrary, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark.

Perhaps we can implement something like regexp_extract_all as Presto and Pig have?

Attachments

Issue Links

links to

[Github] Pull Request #21985 (xueyumusic)

GitHub Pull Request #21985

GitHub Pull Request #27507

Activity

People

Assignee:: Jiaan Geng

Reporter:: Nick Nicolini

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 22/Jul/18 20:55

Updated:: 03/Aug/20 06:04

Resolved:: 03/Aug/20 06:04