Details
Description
Enabling and using `spark.sql.ansi.doubleQuotedIdentifiers` does not work correctly in Python threads
The following example shows how applying a filter, "\"status\" = 'Unchanged'", leads to empty results when run in a thread. I believe this is because the "status" field is interpreted as a literal in the thread, but as an attribute outside of it.
from concurrent import futures from pyspark import sql spark = ( sql.SparkSession.builder.master("local[*]") .config("spark.sql.ansi.enabled", "true") .config("spark.sql.ansi.doubleQuotedIdentifiers", "true") .getOrCreate() ) def demonstrate_issue(spark): # Path to JSON file with contents: # [{"status": "Unchanged"}, {"status": "Changed"}] df = spark.read.json("data/example.json") df.filter("\"status\" = 'Unchanged'").show() # Shows 1 record, expected demonstrate_issue(spark) with futures.ThreadPoolExecutor(1) as executor: # Shows 0 records, unexpected executor.submit(demonstrate_issue, spark)
Additional testing notes:
- When parsing the expression with `sql.functions.expr` in Java via Py4J, the "status" field is interpreted as a literal value from the thread, not an attribute
- Using double quotes with `spark.sql` does work in the thread
- Using a dataframe created in memory does work in the thread
- Tested in versions 3.4.0, 3.4.1, & 3.5.0 on Windows and Mac
The original PR that added this option is here: https://github.com/apache/spark/pull/38022