Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.3.0
-
None
Description
sparkSession.sql API only supports a single SQL statement to be executed for a call. A multi-statement SQL cannot be executed in a single call. For example,
SparkSession sparkSession = SparkSession.builder().appName("MultiStatementSQL") .master("local").config("", "").getOrCreate() sparkSession.sql("DROP TABLE IF EXISTS count_employees; CACHE TABLE employees; CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM employees; SELECT * FROM count_employees")
Above code fails with the error,
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ';' expecting <EOF>
Solution to this problem is to use the .sql API multiple times in a specific order.
sparkSession.sql("DROP TABLE IF EXISTS count_employees") sparkSession.sql("CACHE TABLE employees") sparkSession.sql("CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM employees;") sparkSession.sql("SELECT * FROM count_employees")
If these SQL statements come from a string / file, users have to implement their own parsers to execute this. Like,
val sqlFromFile = """DROP TABLE IF EXISTS count_employees; |CACHE TABLE employees; |CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM employees; SELECT * FROM count_employees""".stripMargin
sqlFromFile.split(";")
.forEach(line => sparkSession.sql(line))
This naive parser can fail for many edge cases (like ";" inside a string). Even if users use the same grammar used by Spark and implement their own parsing, it can go out of sync with the way Spark parses the statements.
Can support for multiple SQL statements be built into SparkSession.sql API itself?
Attachments
Issue Links
- is related to
-
SPARK-30822 Pyspark queries fail if terminated with a semicolon
- Resolved