Hive
  1. Hive
  2. HIVE-1482

Not all jdbc calls are threadsafe.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.7.0
    • Fix Version/s: None
    • Component/s: JDBC
    • Labels:
      None
    1. HIVE-1482-1.patch
      10 kB
      Bennie Schut
    2. HIVE-1482-2.patch
      10 kB
      Bennie Schut

      Activity

      Hide
      Bennie Schut added a comment -

      On HiveConnection we reuse the "client" on many calls but we've already seen the "DatabaseMetaData getMetaData()" call being used in a multithreaded way.

      Show
      Bennie Schut added a comment - On HiveConnection we reuse the "client" on many calls but we've already seen the "DatabaseMetaData getMetaData()" call being used in a multithreaded way.
      Hide
      Bennie Schut added a comment -

      Ok I guess the first questions would be:

      Do we want to go for as much concurrent work as possible or are we ok with using some synchronize on the client call's? I would say sync. is prefered over creating new connections in this case right?

      Do we perhaps want to solve this on the "ThriftHive" level since that's the part which doesn't allow the multi threaded calls or perhaps more on the "HiveDatabaseMetaData" since that's the part which has the requirement of multi thread safety?

      Any idea's are welcome on this.

      Show
      Bennie Schut added a comment - Ok I guess the first questions would be: Do we want to go for as much concurrent work as possible or are we ok with using some synchronize on the client call's? I would say sync. is prefered over creating new connections in this case right? Do we perhaps want to solve this on the "ThriftHive" level since that's the part which doesn't allow the multi threaded calls or perhaps more on the "HiveDatabaseMetaData" since that's the part which has the requirement of multi thread safety? Any idea's are welcome on this.
      Hide
      John Sichi added a comment -

      Yes, synchronized is the way to go.

      I think the synchronization has to be at the connection level. For example, HiveStatement also needs to make calls on the thrift interface. It's not just DatabaseMetaData.

      So we should add a new data member to HiveConnection:

      Object connectionMutex = new Object();

      Then pass connectionMutex to constructors of sub-objects which need to participate in synchronization. They can then do

      synchronized(connectionMutex)

      { ... }

      around their critical sections.

      Creating a separate object for this purpose allows us to keep control over synchronization (e.g. so it doesn't get mixed up with user-level or thrift-level synchronization code later). We'll also need to be able to skip synchronization in the case of asynchronous cancel, but that's a separate task.

      We should also review to see if there is any client-side state which needs protection.

      Show
      John Sichi added a comment - Yes, synchronized is the way to go. I think the synchronization has to be at the connection level. For example, HiveStatement also needs to make calls on the thrift interface. It's not just DatabaseMetaData. So we should add a new data member to HiveConnection: Object connectionMutex = new Object(); Then pass connectionMutex to constructors of sub-objects which need to participate in synchronization. They can then do synchronized(connectionMutex) { ... } around their critical sections. Creating a separate object for this purpose allows us to keep control over synchronization (e.g. so it doesn't get mixed up with user-level or thrift-level synchronization code later). We'll also need to be able to skip synchronization in the case of asynchronous cancel, but that's a separate task. We should also review to see if there is any client-side state which needs protection.
      Hide
      Bennie Schut added a comment -

      Ok I think I covered the sync stuff like you sugested.
      For the state part I was looking at how SessionState is used within the HiveConnection.
      There are several spots where we pass a sessionstate object (HivePreparedStatement, HiveStatement) but both places end up not actually using the object. Simplest thing would be not to pass them. But even if we do pass it I think it only contains session data and not query specific data.
      Query specific things like column names/types etc. can be done nicely within a sync block together with the client call for the query itself so that shouldn't be a problem.

      I'll upload a patch with what I have soon.

      Show
      Bennie Schut added a comment - Ok I think I covered the sync stuff like you sugested. For the state part I was looking at how SessionState is used within the HiveConnection. There are several spots where we pass a sessionstate object (HivePreparedStatement, HiveStatement) but both places end up not actually using the object. Simplest thing would be not to pass them. But even if we do pass it I think it only contains session data and not query specific data. Query specific things like column names/types etc. can be done nicely within a sync block together with the client call for the query itself so that shouldn't be a problem. I'll upload a patch with what I have soon.
      Hide
      Bennie Schut added a comment -

      Full test ran with success on my pc. Plus no additional check-style errors found.
      Anything else we can think of for this?

      Show
      Bennie Schut added a comment - Full test ran with success on my pc. Plus no additional check-style errors found. Anything else we can think of for this?
      Hide
      John Sichi added a comment -

      After looking at the code, I'm thinking we could use Java proxies to make the synchronization automatic and bulletproof.

      http://stackoverflow.com/questions/743288/java-synchronization-utility

      Proxy the HiveInterface, and pass the proxy instance down. That way we don't even need to pass around the connectionMutex (it will be hidden inside the proxy).

      What do you think?

      Show
      John Sichi added a comment - After looking at the code, I'm thinking we could use Java proxies to make the synchronization automatic and bulletproof. http://stackoverflow.com/questions/743288/java-synchronization-utility Proxy the HiveInterface, and pass the proxy instance down. That way we don't even need to pass around the connectionMutex (it will be hidden inside the proxy). What do you think?
      Hide
      John Sichi added a comment -

      Also, can you create a followup for adding a multithreaded test running JDBC against a thrift server?

      Show
      John Sichi added a comment - Also, can you create a followup for adding a multithreaded test running JDBC against a thrift server?
      Hide
      Bennie Schut added a comment -

      The proxy sounds like an interesting approach. It will probably synchronize a little more then we need but it's a lot safer and the normal use case is calling the methods sequentially anyway so not much loss there.

      Doing a little mutithreading test is a good idea. I've had some HIVE_PLAN exceptions in the past (different jira) which would probably also become visible then. I'll have a look.

      Thanks.

      Show
      Bennie Schut added a comment - The proxy sounds like an interesting approach. It will probably synchronize a little more then we need but it's a lot safer and the normal use case is calling the methods sequentially anyway so not much loss there. Doing a little mutithreading test is a good idea. I've had some HIVE_PLAN exceptions in the past (different jira) which would probably also become visible then. I'll have a look. Thanks.
      Hide
      Bennie Schut added a comment -

      I've tried but this is actually surprisingly difficult to reproduce failure in a test on TestJdbcDriver. Perhaps there is something synchronized about the use of the embedded mode the test is running in?

      Show
      Bennie Schut added a comment - I've tried but this is actually surprisingly difficult to reproduce failure in a test on TestJdbcDriver. Perhaps there is something synchronized about the use of the embedded mode the test is running in?
      Hide
      John Sichi added a comment -

      Yeah, the problems may only show up with remoting (which is why I mentioned having the test run against a thrift server instead of embedded).

      Show
      John Sichi added a comment - Yeah, the problems may only show up with remoting (which is why I mentioned having the test run against a thrift server instead of embedded).
      Hide
      Bennie Schut added a comment -

      Still need to reproduce the problem in a test. So this is just to show some progress. It's now using generics with a SynchronizedFactory.

      Show
      Bennie Schut added a comment - Still need to reproduce the problem in a test. So this is just to show some progress. It's now using generics with a SynchronizedFactory.
      Hide
      John Sichi added a comment -

      In HIVE-1899, I added a special-case sync proxy, but it would be better to use the generic version Bennie has here. If someone works on this one more, maybe update HiveMetaStoreClient.newSynchronizedClient as well. (But note that my special-case needs a global lock, not a per-object lock.)

      Show
      John Sichi added a comment - In HIVE-1899 , I added a special-case sync proxy, but it would be better to use the generic version Bennie has here. If someone works on this one more, maybe update HiveMetaStoreClient.newSynchronizedClient as well. (But note that my special-case needs a global lock, not a per-object lock.)

        People

        • Assignee:
          Bennie Schut
          Reporter:
          Bennie Schut
        • Votes:
          1 Vote for this issue
          Watchers:
          5 Start watching this issue

          Dates

          • Created:
            Updated:

            Development