[SPARK-29018] Build spark thrift server on it's own code based on protocol v11 - ASF JIRA

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Background

With the development of Spark and Hive，in current sql/hive-thriftserver module, we need to do a lot of work to solve code conflicts for different built-in hive versions. It's an annoying and unending work in current ways. And these issues have limited our ability and convenience to develop new features for Spark’s thrift server.

We suppose to implement a new thrift server and JDBC driver based on Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature:

Build new module spark-service as spark’s thrift server
Don't need as much reflection and inherited code as `hive-thriftser` modules
Support all functions current `sql/hive-thriftserver` support
Use all code maintained by spark itself, won’t depend on Hive
Support origin functions use spark’s own way, won't limited by Hive's code
Support running without hive metastore or with hive metastore
Support user impersonation by Multi-tenant splited hive authentication and DFS authentication
Support session hook for with spark’s own code
Add a new jdbc driver spark-jdbc, with spark’s own connection url “jdbc:spark:<host>:<port>/<db>”
Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform

How to start?

We can start this new thrift server by shell sbin/start-spark-thriftserver.sh and stop it by sbin/stop-spark-thriftserver.sh. Don’t need HiveConf ’s configurations to determine the characteristics of the current spark thrift server service, we have implemented all need configuration by spark itself in `org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used to connect to hive metastore. We can write all we needed conf in conf/spark-default.conf or in startup command --conf

How to connect through jdbc?

Now we support both hive-jdbc and spark-jdbc, user can choose which one he likes

spark-jdbc

use `SparkDriver` as jdbc driver class
Connection url `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list` most samse as hive but with spark’s special url prefix `jdbc:spark`
For proxy, use SparkDriver should set proxy conf `spark.sql.thriftserver.proxy.user=username`

hive-jdbc

use `HiveDriver` as jdbc driver class
connection str jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list as origin
For proxy, use HiveDriver should set proxy conf hive.server2.proxy.user=username, current server support both config

How is it done today, and what are the limits of current practice?

Current practice

We have completed two modules `spark-service` & `spark-jdbc` now, it can run well and we have add origin UT to it these two module and it can pass the UT, for impersonation, we have write the code and test it in our kerberized environment, it can work well and wait for review. Now we will raise pr to apace/spark master branch step by step.

Here are some known changes:

Don’t use any hive code in `spark-service` `spark-jdbc` module
In current service, default rcfile suffix `.hiverc` was replaced by `.sparkrc`
When use SparkDriver as jdbc driver class, url should use jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
When use SparkDriver as jdbc driver class, proxy conf should be `spark.sql.thriftserver.proxy.user=proxy_user_name`
Support `hiveconf` `hivevar` session conf through hive-jdbc connection

What are the risks?

Totally new module, won’t change other module’s code except for supporting impersonation. Except impersonation, we have added a lot of UT changed (fit grammar without hive) from origin UT, and all pass it. For impersonation I have test it in our kerberized environment but still need detail review since change a lot.

How long will it take?

We have done all these works in our own repo, now we plan merge our code into the master step by step.

Phase1 pr about build new module spark-service on folder sql/service
Phase2 pr thrift protocol and generated thrift protocol java code
Phase3 pr with all spark-service module code and description about design, also Unnit Test
Phase4 pr about build new module spark-jdbc on folder sql/jdbc
Phase5 pr with all spark-jdbc module code and Unit Tests
Phase6 pr about support thriftserver Impersonation
Phase7 pr about build spark's own beeline client spark-beeline
Phase8 pr about spark's own CLI client code to support Spark SQL CLI module named spark-cli

Appendix A. Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account.

Compared to current `sql/hive-thriftserver`, corresponding API changes as below:

Add a new class org.apache.spark.sql.service.internal.ServiceConnf, contains all needed configuration for spark thrift server
ServiceSessionXxx as origin HiveSessionXxx
In ServiceSessionImpl, remove code spark won’t use
In ServiceSessionImpl set session conf directly to sqlConf like https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69
Remove SparkSQLSessionManager, add logic into SessionMananger
Implement all OperationMananegr logic into SparkSQLOperationMananger and rename it to OperationManager
Add SQLContext to ServiceSessionImpl as it’s variable, don’t pass it by SparkSQLOperationManager, just get it by parentSession.getSqlContext() session conf was set to this sqlContext.sqlConf
Remove HiveServer2 since we don’t need the logic in it
Remove logic about hive impersonation since it won’t be useful in spark thrift server and remove parameter delegationTokenStr in ServiceSessionImplWithUGI https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353 we will use new way for spark’s impersonation.
Remove ThriftserverShimUtils, since we don’t need this
Remove SparkSQLCLIService just use CLIService
Remove ReflectionUtils and ReflactCompositeService since we don’t need interition and reflection

Attachments

Issue Links

links to

GitHub Pull Request #25721

Sub-Tasks

1.

Add new module spark-service as thrift server module

Open

Unassigned

Build spark thrift server on it's own code based on protocol v11