Details
Description
Here's a minimal repro:
import pyspark from pyspark.sql import Column from pyspark.sql.functions import regexp_replace, lower, col def normalize_udf(column: Column) -> Column: normalized_column = ( regexp_replace( column, pattern='[\s]+', replacement=' ', ) ) return normalized_column if __name__ == '__main__': spark = pyspark.sql.SparkSession.builder.getOrCreate() raw_df = spark.createDataFrame( [(' ',)], ['string'], ) normalized_df = raw_df.select(normalize_udf('string')) normalized_df_prime = ( normalized_df .groupBy(sorted(normalized_df.columns)) .count()) normalized_df_prime.show()
When I run this I get:
ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, Column 130: Invalid escape sequence
Followed by a huge barf of generated Java code, and then the output I expect. (So despite the scary error, the code actually works!)
Can you spot the error in my code?
It's simple: I just need to alias the output of normalize_udf() and all is forgiven:
normalized_df = raw_df.select(normalize_udf('string').alias('string'))
Of course, it's impossible to tell that from the current error output. So my first question is: Is there some way we can better communicate to the user what went wrong?
Another interesting thing I noticed is that if I try this:
normalized_df = raw_df.select(lower('string'))
I immediately get a clean error saying:
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.lower. Trace: py4j.Py4JException: Method lower([class java.lang.String]) does not exist
I can fix this by building a column object:
normalized_df = raw_df.select(lower(col('string')))
So that raises a second problem/question: Why does lower() require that I build a Column object, whereas regexp_replace() does not? The inconsistency adds to the confusion here.
Attachments
Issue Links
- is duplicated by
-
SPARK-18952 regex strings not properly escaped in codegen for aggregations
- Resolved