Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41666

Support parameterized SQL in PySpark

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • SQL
    • None

    Description

      Enhance the PySpark SQL API with support for parameterized SQL statements to improve security and reusability. Application developers will be able to write SQL with parameter markers whose values will be passed separately from the SQL code and interpreted as literals. This will help prevent SQL injection attacks for applications that generate SQL based on a user’s selections, which is often done via a user interface.

      PySpark has already supported formatting of sqlText using the syntax

      {...}

      . Need to leave the API the same:

      def sql(self, sqlQuery: str, **kwargs: Any) -> DataFrame:
      

      and support new parameters by the same API.

      PySpark sql() should passes unused parameters to the JVM side where the Java sql() method handles them. For example:

      >>> mydf = spark.range(10)
      >>> spark.sql("SELECT id FROM {mydf} WHERE id % @param1 = 0", mydf=mydf, param1='3').show()
      +---+
      | id|
      +---+
      |  0|
      |  3|
      |  6|
      |  9|
      +---+
      

      Attachments

        Activity

          People

            maxgekk Max Gekk
            maxgekk Max Gekk
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: