Hive
  1. Hive
  2. HIVE-584

Clean up global and ThreadLocal variables in Hive

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.5.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      HIVE-584. Clean up global and ThreadLocal variables in Hive. (Yonqqiang He via zshao)

      Description

      Currently in Hive code there are several global and ThreadLocal variables that need to be cleaned.

      Specifically, the following classes are involved:
      1. HiveConf: contains hive configurations (and a classloader)
      2. Hive class: contains a static member Hive db. Hive class contains a member HiveConf conf, as well as a ThreadLocal storage of IMetaStoreClient.
      3. SessionState: contains a static ThreadLocal storage of SessionState. SessionState class contains a Hive db, a HiveConf conf, a history logger, and a bunch of standard input/output streams
      4. CliSessionState: SessionState plus some command options and the command file name.
      5. All classes that try to get Hive db or HiveConf from global static Hive db, or SessionState.

      There are several problems with the current design. To name a few:
      1. SessionState instances are ThreadLocal, but SessionState contains Hive db which also contains ThreadLocal storage. Not sure a db can be shared across different threads or not? What is the global static Hive db?
      2. We pass HiveConf and Hive db in two ways to classes like Task: Sometimes through initialize(), sometimes through SessionState. This complicates the code a lot. It's hard to know which HiveConf and which db we should use.

      We need to think about a better way to do it.

      1. hive-584-2009-9-10.patch
        6 kB
        He Yongqiang
      2. hive-584-2009-9-11.patch
        6 kB
        He Yongqiang

        Issue Links

          Activity

          Hide
          Prasad Chakka added a comment -

          db for Hive.java objects is a misnomer. If we are cleaning up code then we should rename it. All it does is do metadata access.

          Hive.java can be shared across multiple threads as long as they share same HiveConf. But that is useless since we do want multiple threads to have separate HiveConfs.

          We should make Hive.db, Hive.conf, Hive.metastoreClient all non-static, non-threadlocal variables and make Hive.java itself a thread local variable. We nedd Hive.java object independent of SessionState or CliSessionState in lot o code where these objects don't exist. Passing conf around on all of the calls makes code cumbersome.

          public class Hive {
          
            private HiveConf conf;
            private IMetaStoreClient msc;
          
            /**
             * creates and returns Hive object representing metadata for this thread. if a Hive object already exists for this thread, the passed HiveConf
             * is compared with the HiveConf stored in Hive object. If any parameters have changed then a new Hive object is created and returned.
             */
            public static Hive get(HiveConf c) throws HiveException {}
           
            /**
             * similar to get(HiveConf) but a new HiveConf object
             */
            public static Hive get() throws HiveException {}
          }
          
          public class SessionState {
             public static getHive(HiveConf conf) throws HiveException {
                 return Hive.getHive(conf);
             }
             public static getDb() throws HiveException {
                 return Hive.getHive();
             }
             public HiveConf getConf() {
                 return getDb().getConf();
             }
          
             private SessionState(HiveConf conf) {
             }
          
             /** returns thread local SessionState and creates it on first call */
             public static get(HiveConf conf);
             public static get();
          }
          

          SessionState also refers to this thread local Hive object instead of storing it as a class variable.

          Would this work?

          Show
          Prasad Chakka added a comment - db for Hive.java objects is a misnomer. If we are cleaning up code then we should rename it. All it does is do metadata access. Hive.java can be shared across multiple threads as long as they share same HiveConf. But that is useless since we do want multiple threads to have separate HiveConfs. We should make Hive.db, Hive.conf, Hive.metastoreClient all non-static, non-threadlocal variables and make Hive.java itself a thread local variable. We nedd Hive.java object independent of SessionState or CliSessionState in lot o code where these objects don't exist. Passing conf around on all of the calls makes code cumbersome. public class Hive { private HiveConf conf; private IMetaStoreClient msc; /** * creates and returns Hive object representing metadata for this thread. if a Hive object already exists for this thread, the passed HiveConf * is compared with the HiveConf stored in Hive object. If any parameters have changed then a new Hive object is created and returned. */ public static Hive get(HiveConf c) throws HiveException {} /** * similar to get(HiveConf) but a new HiveConf object */ public static Hive get() throws HiveException {} } public class SessionState { public static getHive(HiveConf conf) throws HiveException { return Hive.getHive(conf); } public static getDb() throws HiveException { return Hive.getHive(); } public HiveConf getConf() { return getDb().getConf(); } private SessionState(HiveConf conf) { } /** returns thread local SessionState and creates it on first call */ public static get(HiveConf conf); public static get(); } SessionState also refers to this thread local Hive object instead of storing it as a class variable. Would this work?
          Hide
          Edward Capriolo added a comment -

          I ran into a few of these issues with HWI, so I feel your pain. Passing the conf around is very cumbersome. I remember my case I wanted to get the Conf from a CLI session state. It would be nice if each object stored the conf as a private variable with a getter. It would help out the web interface code a lot. The implementation above seems to make sense.

          Show
          Edward Capriolo added a comment - I ran into a few of these issues with HWI, so I feel your pain. Passing the conf around is very cumbersome. I remember my case I wanted to get the Conf from a CLI session state. It would be nice if each object stored the conf as a private variable with a getter. It would help out the web interface code a lot. The implementation above seems to make sense.
          Hide
          Zheng Shao added a comment -

          Prasad and I had some offline discussions. First of all, some high-level conclusions:

          0. Lifetime of the objects: The lifetime of all hive operators are a part of the lifetime of their configuration. The lifetime of all Tasks (ExecDriver, MapRedTask, MoveTask, etc) are a part of the lifetime of the db connection. The lifetime of a CommandProcessor is also a part of the lifetime of the db connection (well in some CommandProcessor we want to switch and close the db connection, as the actual operation that we want to do). The CliDriver and HiveServerHandler should correspond to a session.
          1. ThreadLocal is good for keeping thread-local variables for ease-of-use (don't need to pass the object around, or give it in the constructor), ease-of-debugging (because there is no chance that another thread will change the content of the thread-local storage), and security reasons (the same as debugging).
          2. Passing objects in constructor (and add a getter) is good for easy understanding of the program flow, as well as allowing the same thread to have 2 different objects (e.g. db connection). There is not such a strong need since all db connection calls are blocking. We can always switch the thread-local db connection to the correct one before we start every call.

          Second, we have to make a choice between the 3:
          A. Make Hive consists of the db connection and the conf. Make it a thread local storage. Tasks (including ExecDriver), CommandProcessors(including Driver), and also CliDriver/HiveServerHandler will access this thread-local db connection and conf. Make SessionState consists of other seesion specific things like stdout, history, etc (but NOT Hive). SessionState is also thread-specific and CliDriver will access SessionState for these information, as well as access Hive for db connection etc. So both Hive and SessionState are independent and thread-local.

          B. Make Hive consists of the db connection and the conf. Pass Hive as a constructor/initialize parameter to all Tasks (including ExecDriver) and CommandProcessors(including Driver). Make SessionState consists of session specific things like stdout, history, and ALSO Hive. CliDriver/HiveServerHandler will use the thread-specific SessionState for all things. So only SessionState is thread-local (while Hive is part of it).

          C. The same as B, except letting SessionState be a parameter to the constructor of CliDriver/HiveServerHandler. So there is no thread-local storage at all. So nothing is thread-local.

          The benefit shared by all these 3 is that Tasks (including ExecDriver) and CommandProcessors(including Driver) don't need to deal with Session - they just need a Hive.

          NOTE: All hive operators (at mapper and reducer) should continue to use the configuration that is passed, just to conform to hadoop model.

          Show
          Zheng Shao added a comment - Prasad and I had some offline discussions. First of all, some high-level conclusions: 0. Lifetime of the objects: The lifetime of all hive operators are a part of the lifetime of their configuration. The lifetime of all Tasks (ExecDriver, MapRedTask, MoveTask, etc) are a part of the lifetime of the db connection. The lifetime of a CommandProcessor is also a part of the lifetime of the db connection (well in some CommandProcessor we want to switch and close the db connection, as the actual operation that we want to do). The CliDriver and HiveServerHandler should correspond to a session. 1. ThreadLocal is good for keeping thread-local variables for ease-of-use (don't need to pass the object around, or give it in the constructor), ease-of-debugging (because there is no chance that another thread will change the content of the thread-local storage), and security reasons (the same as debugging). 2. Passing objects in constructor (and add a getter) is good for easy understanding of the program flow, as well as allowing the same thread to have 2 different objects (e.g. db connection). There is not such a strong need since all db connection calls are blocking. We can always switch the thread-local db connection to the correct one before we start every call. Second, we have to make a choice between the 3: A. Make Hive consists of the db connection and the conf. Make it a thread local storage. Tasks (including ExecDriver), CommandProcessors(including Driver), and also CliDriver/HiveServerHandler will access this thread-local db connection and conf. Make SessionState consists of other seesion specific things like stdout, history, etc (but NOT Hive). SessionState is also thread-specific and CliDriver will access SessionState for these information, as well as access Hive for db connection etc. So both Hive and SessionState are independent and thread-local. B. Make Hive consists of the db connection and the conf. Pass Hive as a constructor/initialize parameter to all Tasks (including ExecDriver) and CommandProcessors(including Driver). Make SessionState consists of session specific things like stdout, history, and ALSO Hive. CliDriver/HiveServerHandler will use the thread-specific SessionState for all things. So only SessionState is thread-local (while Hive is part of it). C. The same as B, except letting SessionState be a parameter to the constructor of CliDriver/HiveServerHandler. So there is no thread-local storage at all. So nothing is thread-local. The benefit shared by all these 3 is that Tasks (including ExecDriver) and CommandProcessors(including Driver) don't need to deal with Session - they just need a Hive. NOTE: All hive operators (at mapper and reducer) should continue to use the configuration that is passed, just to conform to hadoop model.
          Hide
          Prasad Chakka added a comment -

          I would prefer option A because the other options are not really needed since only one object each of SessionState and Hive are needed at any time for a single thread. Using B and C, means the caller has to pass in these objects and find correct objects.

          But if we need, multiple Hive objects per thread at any point of time then, options B & C are better.

          Show
          Prasad Chakka added a comment - I would prefer option A because the other options are not really needed since only one object each of SessionState and Hive are needed at any time for a single thread. Using B and C, means the caller has to pass in these objects and find correct objects. But if we need, multiple Hive objects per thread at any point of time then, options B & C are better.
          Hide
          He Yongqiang added a comment -

          Attach a fix with appoach A.

          Show
          He Yongqiang added a comment - Attach a fix with appoach A.
          Hide
          Zheng Shao added a comment -

          @hive-584-2009-9-10.patch:

          nitpick: This function should directly call "hiveDb.remove()". We don't want duplicated code.
          public static void closeCurrent()

          Show
          Zheng Shao added a comment - @hive-584-2009-9-10.patch: nitpick: This function should directly call "hiveDb.remove()". We don't want duplicated code. public static void closeCurrent()
          Hide
          Zheng Shao added a comment -

          @hive-584-2009-9-10.patch:

          Please remove "SessionState.getDb()" - nobody is calling it, and we should call "Hive.get(conf)" to get the db connection instead.

          Show
          Zheng Shao added a comment - @hive-584-2009-9-10.patch: Please remove "SessionState.getDb()" - nobody is calling it, and we should call "Hive.get(conf)" to get the db connection instead.
          Hide
          Zheng Shao added a comment -

          @hive-584-2009-9-10.patch:

          Tested. Once the 2 items above are addressed, I will test again and commit.

          Show
          Zheng Shao added a comment - @hive-584-2009-9-10.patch: Tested. Once the 2 items above are addressed, I will test again and commit.
          Hide
          He Yongqiang added a comment -

          Integrate Zheng's comments. Thanks, Zheng.

          Show
          He Yongqiang added a comment - Integrate Zheng's comments. Thanks, Zheng.
          Hide
          Zheng Shao added a comment -

          Committed. Thanks Yongqiang!

          Show
          Zheng Shao added a comment - Committed. Thanks Yongqiang!
          Hide
          Ning Zhang added a comment -

          Just had an offline conversation with Zheng, trying to understand the concepts:

          Life time:

          • Session is the longest living object. It corresponds to the life time from user connecting to Hive using Hive CLI or Hive Server till closing it.
          • The DB connection life time is the time that Hive connects to the metastore. Currently its life time is always the same as that of the session. But going forward, it could be different since a session could be connecting to different metastores (by providing a different HiveConf). This is useful if there are multiple instances of Hadoop clusters, each of which maintain a different Hive DB.
          Show
          Ning Zhang added a comment - Just had an offline conversation with Zheng, trying to understand the concepts: Life time: Session is the longest living object. It corresponds to the life time from user connecting to Hive using Hive CLI or Hive Server till closing it. The DB connection life time is the time that Hive connects to the metastore. Currently its life time is always the same as that of the session. But going forward, it could be different since a session could be connecting to different metastores (by providing a different HiveConf). This is useful if there are multiple instances of Hadoop clusters, each of which maintain a different Hive DB.

            People

            • Assignee:
              Unassigned
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development