[SPARK-18866] Codegen fails with cryptic error if regexp_replace() output column is not aliased - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.0.2, 2.1.0
Fix Version/s: 2.1.1, 2.2.0
Component/s: PySpark, SQL
Labels:
None
Environment:

Java 8, Python 3.5

Description

Here's a minimal repro:

import pyspark
from pyspark.sql import Column
from pyspark.sql.functions import regexp_replace, lower, col


def normalize_udf(column: Column) -> Column:
    normalized_column = (
        regexp_replace(
            column,
            pattern='[\s]+',
            replacement=' ',
        )
    )
    return normalized_column


if __name__ == '__main__':
    spark = pyspark.sql.SparkSession.builder.getOrCreate()
    raw_df = spark.createDataFrame(
        [('          ',)],
        ['string'],
    )
    normalized_df = raw_df.select(normalize_udf('string'))
    normalized_df_prime = (
        normalized_df
        .groupBy(sorted(normalized_df.columns))
        .count())
    normalized_df_prime.show()

When I run this I get:

ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, Column 130: Invalid escape sequence

Followed by a huge barf of generated Java code, and then the output I expect. (So despite the scary error, the code actually works!)

Can you spot the error in my code?

It's simple: I just need to alias the output of normalize_udf() and all is forgiven:

normalized_df = raw_df.select(normalize_udf('string').alias('string'))

Of course, it's impossible to tell that from the current error output. So my first question is: Is there some way we can better communicate to the user what went wrong?

Another interesting thing I noticed is that if I try this:

normalized_df = raw_df.select(lower('string'))

I immediately get a clean error saying:

py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.lower. Trace:
py4j.Py4JException: Method lower([class java.lang.String]) does not exist

I can fix this by building a column object:

normalized_df = raw_df.select(lower(col('string')))

So that raises a second problem/question: Why does lower() require that I build a Column object, whereas regexp_replace() does not? The inconsistency adds to the confusion here.

Attachments

Issue Links

is duplicated by

SPARK-18952 regex strings not properly escaped in codegen for aggregations

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Chammas

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Dec/16 23:22

Updated:: 09/Jan/17 23:19

Resolved:: 09/Jan/17 23:19