Hive
  1. Hive
  2. HIVE-5172

TUGIContainingTransport returning null transport, causing intermittent SocketTimeoutException on hive client and NullPointerException in TUGIBasedProcessor on the server

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0, 0.10.0, 0.11.0
    • Fix Version/s: None
    • Component/s: Metastore
    • Labels:
      None

      Description

      We are running into frequent problem using HCatalog 0.4.1 (Hive Metastore Server 0.9) where we get connection reset or connection timeout errors on the client and NullPointerException in TUGIBasedProcessor on the server.

      
      hive client logs:
      =================
      
      org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
      at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
      at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
      at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
      at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
      at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
      at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
      at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
      at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
      at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
      at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
      at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157)
      at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
      at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
      at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
      at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830)
      at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954)
      at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524)
      at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
      at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
      at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
      at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
      at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
      at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
      at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
      Caused by: java.net.SocketTimeoutException: Read timed out
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.read(SocketInputStream.java:129)
      at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
      ... 31 more
      
      
      hive metastore server logs:
      ===============================
      
      2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182)) - Error occurred during processing of message.
      java.lang.NullPointerException
              at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
              at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
              at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:662)
      

      Adding some extra debug log messages in TUGIBasedProcessor, noticed that the TUGIContainingTransport is null which results in NullPointerException on the server.

      Further drilling into TUGIContainingTransport noticed that getTransport() returns a null which causes the above
      error.

      Further corelating with GC logs observed that that error always hits when the CMS GC has just kicked in
      (but does not happen after every GC)

      Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to

      @Override
      public TUGIContainingTransport getTransport(TTransport trans) {
      
            // UGI information is not available at connection setup time, it will be set later
            // via set_ugi() rpc.
            transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
      
            //return transMap.get(trans); //<-change
      
            TUGIContainingTransport retTrans = transMap.get(trans);
      
            if ( retTrans == null ) {
                   LOGGER.error ("cannot find transport that was in map !!")
      	     return retTrans;
             }  else {
                   LOGGER.debug ("found transport in map")
                   return retTrans;    
             }
      }
      

      When we run this in our test environment, see that we run into the problem just after GC runs,
      and "cannot find transport that was in the map!!" message gets logged.

      Could the GC be collecting entries from transMap, just before the we get it

      Tried a minor change which seems to work

      public TUGIContainingTransport getTransport(TTransport trans) {
      
         TUGIContainingTransport tugiTrans = transMap.get(trans);
      
          if ( tugiTrans == null ) {
            // UGI information is not available at connection setup time, it will be set later
            // via set_ugi() rpc.
             tugiTrans = new TUGIContainingTransport(trans); //get a new TUGIContainingTransport
              transMap.putIfAbsent(trans, tugiTrans);
          }
         return tugiTrans;
      }
      

      My questions for hive and thrift experts

      1.) Do we need to use a ConcurrentMap

      ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new MapMaker().weakKeys().weakValues().makeMap();
      

      It does use == to compare keys (which might be the problem), also in this case we cant rely on the trans to be always there in the transMap, even after a put, so in that case change above probably makes sense

      2.) Is it better idea to use WeakHashMap with WeakReference value instead ? (was looking at org.apache.thrift.transport.TSaslServerTransport, especially change made by THRIFT-1468 as an example)

      e.g.

      private static Map<TTransport, WeakReference<TUGIContainingTransport>> transMap = Collections.synchronizedMap(new WeakHashMap<TTransport, WeakReference<TUGIContainingTransport>>());
      

      getTransport() would be something like

      public TUGIContainingTransport getTransport(TTransport trans) {
            WeakReference<TUGIContainingTransport> ret = transMap.get(trans);
            if (ret == null || ret.get() == null) {
              ret = new WeakReference<TUGIContainingTransport>(new TUGIContainingTransport(trans));  
              transMap.put(trans, ret); // No need for putIfAbsent().
               // Concurrent calls to getTransport() will pass in different TTransports.
            }
            return ret.get();
      }
      

      3.) Is it absolutely essential that successive calls to get getTransport() need to retrieve the same TUGIContainingTransport object or its ok to create a new one, if its not available in the map ?

        Issue Links

          Activity

          Hide
          agate added a comment -

          Uploading a patch which creates a new TUGIContainingTransport and adds it to the map, and returns the newly created TUGIContainingTransport

          Show
          agate added a comment - Uploading a patch which creates a new TUGIContainingTransport and adds it to the map, and returns the newly created TUGIContainingTransport
          Hide
          agate added a comment -

          Please review this patch. Its a minor change as noted in description above. Checks transMap for cached TUGIContainingTransport, creates a new object and adds it cache, and returns it (doesnt look in the cache after put, where the possibility is that it wont be there if the entry has been garbage collected).

          Show
          agate added a comment - Please review this patch. Its a minor change as noted in description above. Checks transMap for cached TUGIContainingTransport, creates a new object and adds it cache, and returns it (doesnt look in the cache after put, where the possibility is that it wont be there if the entry has been garbage collected).
          Hide
          Hive QA added a comment -

          Overall: +1 all checks pass

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12600723/HIVE-5172.1.patch.txt

          SUCCESS: +1 2902 tests passed

          Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/578/testReport
          Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/578/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          

          This message is automatically generated.

          Show
          Hive QA added a comment - Overall : +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12600723/HIVE-5172.1.patch.txt SUCCESS: +1 2902 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/578/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/578/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase This message is automatically generated.
          Hide
          Navis added a comment -

          +1 for the patch, but I cannot reply any of questions. Ashutosh Chauhan Could you check this?

          Show
          Navis added a comment - +1 for the patch, but I cannot reply any of questions. Ashutosh Chauhan Could you check this?
          Hide
          Ashutosh Chauhan added a comment -

          As I mentioned in https://issues.apache.org/jira/browse/HIVE-3805?focusedCommentId=13533106&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13533106 this TUGIContainingTransport is really a hack, current way to do this is to use Plain Sasl Server otherwise we keep running into problems like this. agate wondering if you would like to pursue the 'proper fix'?
          If not, than I need to think a bit for this current patch. Will get back to you soon.

          Show
          Ashutosh Chauhan added a comment - As I mentioned in https://issues.apache.org/jira/browse/HIVE-3805?focusedCommentId=13533106&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13533106 this TUGIContainingTransport is really a hack, current way to do this is to use Plain Sasl Server otherwise we keep running into problems like this. agate wondering if you would like to pursue the 'proper fix'? If not, than I need to think a bit for this current patch. Will get back to you soon.
          Hide
          agate added a comment -

          ~ashutoshc, We have a patched version currently running in our test environment and havent run into any other problems for more than two weeks. I agree having a elegant solution is probably warranted but probably should be done in a different JIRA (just my opinion). This problem was a showstopper P1 issue for us and had to spend weeks to track it down.

          Show
          agate added a comment - ~ashutoshc, We have a patched version currently running in our test environment and havent run into any other problems for more than two weeks. I agree having a elegant solution is probably warranted but probably should be done in a different JIRA (just my opinion). This problem was a showstopper P1 issue for us and had to spend weeks to track it down.
          Hide
          Ashutosh Chauhan added a comment -

          Though fix seems harmless, I think it doesn't eliminate root cause. Under intense GC pressure, same problem may occur even with this fix. I am fine putting this in, but just be aware this fix has not solved the root cause.

          Show
          Ashutosh Chauhan added a comment - Though fix seems harmless, I think it doesn't eliminate root cause. Under intense GC pressure, same problem may occur even with this fix. I am fine putting this in, but just be aware this fix has not solved the root cause.
          Hide
          agate added a comment -

          Thanks Ashutosh Chauhan and Navis for taking the time to look at this patch. Can we put this patch in ? Having this patch in will be helpful for lot of users as its better than the way it is right now (which is not usable at all for production use under any circumstance)

          Show
          agate added a comment - Thanks Ashutosh Chauhan and Navis for taking the time to look at this patch. Can we put this patch in ? Having this patch in will be helpful for lot of users as its better than the way it is right now (which is not usable at all for production use under any circumstance)
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-hadoop1-ptest #210 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/210/)
          HIVE-5597 : Temporary fix for HIVE-5172 (possible NPE in TUGIContainingTransport) (agate via Navis) (navis: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1534524)

          • /hive/trunk/shims/src/common/java/org/apache/hadoop/hive/thrift/TUGIContainingTransport.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-hadoop1-ptest #210 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/210/ ) HIVE-5597 : Temporary fix for HIVE-5172 (possible NPE in TUGIContainingTransport) (agate via Navis) (navis: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1534524 ) /hive/trunk/shims/src/common/java/org/apache/hadoop/hive/thrift/TUGIContainingTransport.java
          Hide
          Navis added a comment -

          As mentioned above by Ashutosh, current patch is not complete fix for this problem. So I've made separate issue and committed to trunk for a temporal mending.

          Show
          Navis added a comment - As mentioned above by Ashutosh, current patch is not complete fix for this problem. So I've made separate issue and committed to trunk for a temporal mending.
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-hadoop2-ptest #148 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/148/)
          HIVE-5597 : Temporary fix for HIVE-5172 (possible NPE in TUGIContainingTransport) (agate via Navis) (navis: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1534524)

          • /hive/trunk/shims/src/common/java/org/apache/hadoop/hive/thrift/TUGIContainingTransport.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-hadoop2-ptest #148 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/148/ ) HIVE-5597 : Temporary fix for HIVE-5172 (possible NPE in TUGIContainingTransport) (agate via Navis) (navis: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1534524 ) /hive/trunk/shims/src/common/java/org/apache/hadoop/hive/thrift/TUGIContainingTransport.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-h0.21 #2414 (See https://builds.apache.org/job/Hive-trunk-h0.21/2414/)
          HIVE-5597 : Temporary fix for HIVE-5172 (possible NPE in TUGIContainingTransport) (agate via Navis) (navis: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1534524)

          • /hive/trunk/shims/src/common/java/org/apache/hadoop/hive/thrift/TUGIContainingTransport.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-h0.21 #2414 (See https://builds.apache.org/job/Hive-trunk-h0.21/2414/ ) HIVE-5597 : Temporary fix for HIVE-5172 (possible NPE in TUGIContainingTransport) (agate via Navis) (navis: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1534524 ) /hive/trunk/shims/src/common/java/org/apache/hadoop/hive/thrift/TUGIContainingTransport.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-hadoop2 #516 (See https://builds.apache.org/job/Hive-trunk-hadoop2/516/)
          HIVE-5597 : Temporary fix for HIVE-5172 (possible NPE in TUGIContainingTransport) (agate via Navis) (navis: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1534524)

          • /hive/trunk/shims/src/common/java/org/apache/hadoop/hive/thrift/TUGIContainingTransport.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-hadoop2 #516 (See https://builds.apache.org/job/Hive-trunk-hadoop2/516/ ) HIVE-5597 : Temporary fix for HIVE-5172 (possible NPE in TUGIContainingTransport) (agate via Navis) (navis: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1534524 ) /hive/trunk/shims/src/common/java/org/apache/hadoop/hive/thrift/TUGIContainingTransport.java

            People

            • Assignee:
              Unassigned
              Reporter:
              agate
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development