Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29018

Build spark thrift server on it's own code based on protocol v11

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 3.0.0
    • None
    • SQL
    • None

    Description

      Background

          With the development of Spark and Hive,in current sql/hive-thriftserver module, we need to do a lot of work to solve code conflicts for different built-in hive versions. It's an annoying and unending work in current ways. And these issues have limited our ability and convenience to develop new features for Spark’s thrift server. 

          We suppose to implement a new thrift server and JDBC driver based on Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature:

      1. Build new module spark-service as spark’s thrift server 
      2. Don't need as much reflection and inherited code as `hive-thriftser` modules
      3. Support all functions current `sql/hive-thriftserver` support
      4. Use all code maintained by spark itself, won’t depend on Hive
      5. Support origin functions use spark’s own way, won't limited by Hive's code
      6. Support running without hive metastore or with hive metastore
      7. Support user impersonation by Multi-tenant splited hive authentication and DFS authentication
      8. Support session hook for with spark’s own code
      9. Add a new jdbc driver spark-jdbc, with spark’s own connection url  “jdbc:spark:<host>:<port>/<db>”
      10. Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform

      How to start?

           We can start this new thrift server by shell sbin/start-spark-thriftserver.sh and stop it by sbin/stop-spark-thriftserver.sh. Don’t need HiveConf ’s configurations to determine the characteristics of the current spark thrift server service, we  have implemented all need configuration by spark itself in `org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used to connect to hive metastore. We can write all we needed conf in conf/spark-default.conf or in startup command --conf

      How to connect through jdbc?

         Now we support both hive-jdbc and spark-jdbc, user can choose which one he likes

      spark-jdbc

      1. use `SparkDriver` as jdbc driver class
      2. Connection url `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list` most samse as hive but with spark’s special url prefix `jdbc:spark`
      3. For proxy, use SparkDriver should set proxy conf `spark.sql.thriftserver.proxy.user=username` 

      hive-jdbc

      1. use `HiveDriver` as jdbc driver class
      2. connection str jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list  as origin 
      3. For proxy, use HiveDriver should set proxy conf hive.server2.proxy.user=username, current server support both config

      How is it done today, and what are the limits of current practice?

      Current practice

      We have completed two modules `spark-service` & `spark-jdbc` now, it can run well  and we have add origin UT to it these two module and it can pass the UT, for impersonation, we have write the code and test it in our kerberized environment, it can work well and wait for review. Now we will raise pr to apace/spark master branch step by step.

      Here are some known changes:

      1. Don’t use any hive code in `spark-service` `spark-jdbc` module
      2. In current service, default rcfile suffix  `.hiverc` was replaced by `.sparkrc`
      3. When use SparkDriver as jdbc driver class, url should use jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
      4. When use SparkDriver as jdbc driver class, proxy conf should be `spark.sql.thriftserver.proxy.user=proxy_user_name`
      5. Support `hiveconf` `hivevar` session conf through hive-jdbc connection

      What are the risks?

          Totally new module, won’t change other module’s code except for supporting impersonation. Except impersonation, we have added a lot of UT changed (fit grammar without hive) from origin UT, and all pass it. For impersonation I have test it in our kerberized environment but still need detail review since change a lot.

      How long will it take?

             We have done all these works in our own repo, now we plan merge our code into the master step by step.

      1. Phase1 pr about build new module spark-service on folder sql/service
      2. Phase2 pr thrift protocol and generated thrift protocol java code
      3. Phase3 pr with all spark-service module code and description about design, also Unnit Test
      4. Phase4 pr about build new module spark-jdbc on folder sql/jdbc
      5. Phase5 pr with all spark-jdbc module code and Unit Tests
      6. Phase6 pr about support thriftserver Impersonation
      7. Phase7 pr about build spark's own beeline client spark-beeline
      8. Phase8 pr about spark's own CLI client code to support Spark SQL CLI module named spark-cli

      Appendix A. Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account.

      Compared to current `sql/hive-thriftserver`,  corresponding API changes as below:

       

      1. Add a new class org.apache.spark.sql.service.internal.ServiceConnf, contains all needed configuration for spark thrift server
      2. ServiceSessionXxx as origin HiveSessionXxx
      3. In ServiceSessionImpl, remove  code spark won’t use
      4. In ServiceSessionImpl set session conf directly to sqlConf  like https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69
      5. Remove SparkSQLSessionManager, add logic into SessionMananger
      6. Implement all OperationMananegr logic into SparkSQLOperationMananger and rename it to OperationManager
      7. Add SQLContext to ServiceSessionImpl  as it’s variable, don’t pass it by SparkSQLOperationManager, just get it by parentSession.getSqlContext() session conf was set to this sqlContext.sqlConf
      8. Remove HiveServer2 since we don’t need the logic in it
      9. Remove logic about hive impersonation since it won’t be useful in spark thrift server and remove parameter delegationTokenStr in ServiceSessionImplWithUGI https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353   we will use new way for spark’s impersonation.
      10. Remove ThriftserverShimUtils, since we don’t need this
      11. Remove SparkSQLCLIService just use CLIService 
      12. Remove ReflectionUtils and ReflactCompositeService since we don’t need interition and reflection

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              angerszhuuu angerszhu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: