Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18866

Codegen fails with cryptic error if regexp_replace() output column is not aliased

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.0.2, 2.1.0
    • 2.1.1, 2.2.0
    • PySpark, SQL
    • None
    • Java 8, Python 3.5

    Description

      Here's a minimal repro:

      import pyspark
      from pyspark.sql import Column
      from pyspark.sql.functions import regexp_replace, lower, col
      
      
      def normalize_udf(column: Column) -> Column:
          normalized_column = (
              regexp_replace(
                  column,
                  pattern='[\s]+',
                  replacement=' ',
              )
          )
          return normalized_column
      
      
      if __name__ == '__main__':
          spark = pyspark.sql.SparkSession.builder.getOrCreate()
          raw_df = spark.createDataFrame(
              [('          ',)],
              ['string'],
          )
          normalized_df = raw_df.select(normalize_udf('string'))
          normalized_df_prime = (
              normalized_df
              .groupBy(sorted(normalized_df.columns))
              .count())
          normalized_df_prime.show()
      

      When I run this I get:

      ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, Column 130: Invalid escape sequence
      

      Followed by a huge barf of generated Java code, and then the output I expect. (So despite the scary error, the code actually works!)

      Can you spot the error in my code?

      It's simple: I just need to alias the output of normalize_udf() and all is forgiven:

      normalized_df = raw_df.select(normalize_udf('string').alias('string'))
      

      Of course, it's impossible to tell that from the current error output. So my first question is: Is there some way we can better communicate to the user what went wrong?

      Another interesting thing I noticed is that if I try this:

      normalized_df = raw_df.select(lower('string'))
      

      I immediately get a clean error saying:

      py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.lower. Trace:
      py4j.Py4JException: Method lower([class java.lang.String]) does not exist
      

      I can fix this by building a column object:

      normalized_df = raw_df.select(lower(col('string')))
      

      So that raises a second problem/question: Why does lower() require that I build a Column object, whereas regexp_replace() does not? The inconsistency adds to the confusion here.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nchammas Nicholas Chammas
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: