Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20399

Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.2.0
    • SQL
    • None

    Description

      The new SQL parser is introduced into Spark 2.0. Seems it bring an issue regarding the regex pattern string.

      The following codes can reproduce it:

      val data = Seq("\u0020\u0021\u0023", "abc")
      val df = data.toDF()
      
      // 1st usage: works in 1.6
      // Let parser parse pattern string
      val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
      // 2nd usage: works in 1.6, 2.x
      // Call Column.rlike so the pattern string is a literal which doesn't go through parser
      val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
      
      // In 2.x, we need add backslashes to make regex pattern parsed correctly
      val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")
      

      Due to unescaping SQL String in parser, the first usage working in 1.6 can't work in 2.0. To make it work, we need to add additional backslashes.

      It is quite weird that we can't use the same regex pattern string in the 2 usages. I think we should not unescape regex pattern string.

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            viirya L. C. Hsieh
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: