Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24884

Implement regexp_extract_all

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.1
    • 3.1.0
    • SQL
    • None

    Description

      I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like:

      AAA:WORDS|
      BBB:TEXT|
      MSG:ASDF|
      MSG:QWER|
      ...
      MSG:ZXCV|

      Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the existing regexp_extract method since the number of occurrences is always arbitrary, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark.

      Perhaps we can implement something like regexp_extract_all as Presto and Pig have?

       

      Attachments

        Activity

          People

            beliefer Jiaan Geng
            nnicolini Nick Nicolini
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: