Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24884

Implement regexp_extract_all

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 3.1.0
    • Component/s: SQL
    • Labels:
      None

      Description

      I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like:

      AAA:WORDS|
      BBB:TEXT|
      MSG:ASDF|
      MSG:QWER|
      ...
      MSG:ZXCV|

      Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the existing regexp_extract method since the number of occurrences is always arbitrary, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark.

      Perhaps we can implement something like regexp_extract_all as Presto and Pig have?

       

        Attachments

          Activity

            People

            • Assignee:
              beliefer jiaan.geng
              Reporter:
              nnicolini Nick Nicolini

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment