Description
Enhance the PySpark SQL API with support for parameterized SQL statements to improve security and reusability. Application developers will be able to write SQL with parameter markers whose values will be passed separately from the SQL code and interpreted as literals. This will help prevent SQL injection attacks for applications that generate SQL based on a user’s selections, which is often done via a user interface.
PySpark has already supported formatting of sqlText using the syntax
{...}. Need to leave the API the same:
def sql(self, sqlQuery: str, **kwargs: Any) -> DataFrame:
and support new parameters by the same API.
PySpark sql() should passes unused parameters to the JVM side where the Java sql() method handles them. For example:
>>> mydf = spark.range(10) >>> spark.sql("SELECT id FROM {mydf} WHERE id % @param1 = 0", mydf=mydf, param1='3').show() +---+ | id| +---+ | 0| | 3| | 6| | 9| +---+
Attachments
Issue Links
- is a clone of
-
SPARK-41271 Parameterized SQL
- Resolved
- links to