Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39107

Silent change in regexp_replace's handling of empty strings

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Hi, we just upgraded from 3.0.2 to 3.1.2 and noticed a silent behavior change that a) seems incorrect, and b) is undocumented in the migration guide:

      3.0.2
      scala> val df = spark.sql("SELECT '' AS col")
      df: org.apache.spark.sql.DataFrame = [col: string]
      
      scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
      +---+--------+
      |col|replaced|
      +---+--------+
      |   | <empty>|
      +---+--------+
      
      3.1.2
      scala> val df = spark.sql("SELECT '' AS col")
      df: org.apache.spark.sql.DataFrame = [col: string]
      
      scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
      +---+--------+
      |col|replaced|
      +---+--------+
      |   |        |
      +---+--------+
      

      Note, the regular expression ^$ should match the empty string, but doesn't in version 3.1. E.g. this is the Java behavior:

      scala> "".replaceAll("^$", "<empty>");
      res1: String = <empty>
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            LorenzoMartini94 Lorenzo Martini
            rshkv Willi Raschkowski
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment