Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-5172

TUGIContainingTransport returning null transport, causing intermittent SocketTimeoutException on hive client and NullPointerException in TUGIBasedProcessor on the server

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 0.9.0, 0.10.0, 0.11.0
    • None
    • Metastore
    • None

    Description

      We are running into frequent problem using HCatalog 0.4.1 (Hive Metastore Server 0.9) where we get connection reset or connection timeout errors on the client and NullPointerException in TUGIBasedProcessor on the server.

      hive client logs:
      =================
      
      org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
      at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
      at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
      at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
      at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
      at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
      at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
      at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
      at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
      at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
      at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
      at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157)
      at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
      at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
      at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
      at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830)
      at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954)
      at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524)
      at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
      at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
      at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
      at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
      at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
      at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
      at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
      Caused by: java.net.SocketTimeoutException: Read timed out
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.read(SocketInputStream.java:129)
      at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
      ... 31 more
      
      hive metastore server logs:
      ===============================
      
      2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182)) - Error occurred during processing of message.
      java.lang.NullPointerException
              at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
              at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
              at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:662)
      

      Adding some extra debug log messages in TUGIBasedProcessor, noticed that the TUGIContainingTransport is null which results in NullPointerException on the server.

      Further drilling into TUGIContainingTransport noticed that getTransport() returns a null which causes the above
      error.

      Further corelating with GC logs observed that that error always hits when the CMS GC has just kicked in
      (but does not happen after every GC)

      Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to

      @Override
      public TUGIContainingTransport getTransport(TTransport trans) {
      
            // UGI information is not available at connection setup time, it will be set later
            // via set_ugi() rpc.
            transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
      
            //return transMap.get(trans); //<-change
      
            TUGIContainingTransport retTrans = transMap.get(trans);
      
            if ( retTrans == null ) {
                   LOGGER.error ("cannot find transport that was in map !!")
      	     return retTrans;
             }  else {
                   LOGGER.debug ("found transport in map")
                   return retTrans;    
             }
      }
      

      When we run this in our test environment, see that we run into the problem just after GC runs,
      and "cannot find transport that was in the map!!" message gets logged.

      Could the GC be collecting entries from transMap, just before the we get it

      Tried a minor change which seems to work

      public TUGIContainingTransport getTransport(TTransport trans) {
      
         TUGIContainingTransport tugiTrans = transMap.get(trans);
      
          if ( tugiTrans == null ) {
            // UGI information is not available at connection setup time, it will be set later
            // via set_ugi() rpc.
             tugiTrans = new TUGIContainingTransport(trans); //get a new TUGIContainingTransport
              transMap.putIfAbsent(trans, tugiTrans);
          }
         return tugiTrans;
      }
      

      My questions for hive and thrift experts

      1.) Do we need to use a ConcurrentMap

      ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new MapMaker().weakKeys().weakValues().makeMap();
      

      It does use == to compare keys (which might be the problem), also in this case we cant rely on the trans to be always there in the transMap, even after a put, so in that case change above probably makes sense

      2.) Is it better idea to use WeakHashMap with WeakReference value instead ? (was looking at org.apache.thrift.transport.TSaslServerTransport, especially change made by THRIFT-1468 as an example)

      e.g.

      private static Map<TTransport, WeakReference<TUGIContainingTransport>> transMap = Collections.synchronizedMap(new WeakHashMap<TTransport, WeakReference<TUGIContainingTransport>>());
      

      getTransport() would be something like

      public TUGIContainingTransport getTransport(TTransport trans) {
            WeakReference<TUGIContainingTransport> ret = transMap.get(trans);
            if (ret == null || ret.get() == null) {
              ret = new WeakReference<TUGIContainingTransport>(new TUGIContainingTransport(trans));  
              transMap.put(trans, ret); // No need for putIfAbsent().
               // Concurrent calls to getTransport() will pass in different TTransports.
            }
            return ret.get();
      }
      

      3.) Is it absolutely essential that successive calls to get getTransport() need to retrieve the same TUGIContainingTransport object or its ok to create a new one, if its not available in the map ?

      Attachments

        1. HIVE-5172.1.patch.txt
          0.9 kB
          Ashwin Agate

        Issue Links

          Activity

            People

              Unassigned Unassigned
              agateaaa Ashwin Agate
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: