[SPARK-46190] ANSI Double quoted identifiers do not work in Python threads - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.4.0, 3.4.1, 3.5.0
Fix Version/s: None
Component/s: PySpark
Labels:
- ansi-sql
- python
- threads

Description

Enabling and using `spark.sql.ansi.doubleQuotedIdentifiers` does not work correctly in Python threads

The following example shows how applying a filter, "\"status\" = 'Unchanged'", leads to empty results when run in a thread. I believe this is because the "status" field is interpreted as a literal in the thread, but as an attribute outside of it.

from concurrent import futures
from pyspark import sql

spark = (
  sql.SparkSession.builder.master("local[*]")
  .config("spark.sql.ansi.enabled", "true")
  .config("spark.sql.ansi.doubleQuotedIdentifiers", "true")
  .getOrCreate()
)

def demonstrate_issue(spark):
  # Path to JSON file with contents:
  # [{"status": "Unchanged"}, {"status": "Changed"}]
  df = spark.read.json("data/example.json")
  df.filter("\"status\" = 'Unchanged'").show()

# Shows 1 record, expected
demonstrate_issue(spark)

with futures.ThreadPoolExecutor(1) as executor:
  # Shows 0 records, unexpected
  executor.submit(demonstrate_issue, spark)

Additional testing notes:

When parsing the expression with `sql.functions.expr` in Java via Py4J, the "status" field is interpreted as a literal value from the thread, not an attribute
Using double quotes with `spark.sql` does work in the thread
Using a dataframe created in memory does work in the thread
Tested in versions 3.4.0, 3.4.1, & 3.5.0 on Windows and Mac

The original PR that added this option is here: https://github.com/apache/spark/pull/38022

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Max Payson

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Nov/23 22:30

Updated:: 30/Nov/23 22:30