Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16324

regexp_extract should doc that it returns empty string when match fails

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.0.1, 2.1.0
    • PySpark, SQL
    • None

    Description

      The documentation for regexp_extract isn't clear about how it should behave if the regex didn't match the row. However, the Java documentation it refers for further detail suggests that the return value should be null if the group wasn't matched at all, empty string is the group actually matched empty string, and an exception raised if the entire regex didn't match.

      This would be identical to how python's own re module behaves when a MatchObject.group() is called.

      However, in practice regexp_extract() returns empty string when the match fails. This seems to be a bug; if it was intended as a feature, it should have been documented as such - and it was probably not a good idea since it can result in silent bugs.

      import pyspark.sql.functions as F
      df = spark.createDataFrame([['abc']], ['text'])
      assert df.select(F.regexp_extract('text', r'(z)', 1)).first()[0] == ''
      

      Attachments

        Issue Links

          Activity

            People

              srowen Sean R. Owen
              mmoroz Max Moroz
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: