Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.15.0
Description
We have a test for Geode on Kubernetes which:
- Deploys a Geode cluster consisting of 2 locator Pods, 3 server Pods
- Deploys 5 Spring boot client Pods which continually do PUTs and GETs
- Triggers a rolling restart of the locator Pods
- The rolling restart operation restarts one locator at a time, waiting for each restarted locator to become fully online before restarting the next locator
- Stops the client operations and validates there were no exceptions thrown in the clients.
Occasionally, we see NoAvailableLocatorsException thrown on one of the clients:
org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect to any locators in the list [system-test-gemfire-locator-0.system-test-gemfire-locator.gemfire-system-test-3f1ecc74-b1ea-4288-b4d1-594bbb8364ab.svc.cluster.local:10334, system-test-gemfire-locator-1.system-test-gemfire-locator.gemfire-system-test-3f1ecc74-b1ea-4288-b4d1-594bbb8364ab.svc.cluster.local:10334] at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:174) at org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:198) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:196) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:190) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:276) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:136) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:119) at org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:801) at org.apache.geode.cache.client.internal.GetOp.execute(GetOp.java:92) at org.apache.geode.cache.client.internal.ServerRegionProxy.get(ServerRegionProxy.java:114) at org.apache.geode.internal.cache.LocalRegion.findObjectInSystem(LocalRegion.java:2802) at org.apache.geode.internal.cache.LocalRegion.getObject(LocalRegion.java:1469) at org.apache.geode.internal.cache.LocalRegion.nonTxnFindObject(LocalRegion.java:1442) at org.apache.geode.internal.cache.LocalRegionDataView.findObject(LocalRegionDataView.java:197) at org.apache.geode.internal.cache.LocalRegion.get(LocalRegion.java:1379) at org.apache.geode.internal.cache.LocalRegion.get(LocalRegion.java:1318) at org.apache.geode.internal.cache.LocalRegion.get(LocalRegion.java:1303) at org.apache.geode.internal.cache.AbstractRegion.get(AbstractRegion.java:439) at org.apache.geode.kubernetes.client.service.AsyncOperationService.evaluate(AsyncOperationService.java:282) at org.apache.geode.kubernetes.client.api.Controller.evaluateRegion(Controller.java:88) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:197) at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:141) at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:106) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:894) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:808) at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1063) at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:963) at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898) at javax.servlet.http.HttpServlet.service(HttpServlet.java:626) at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) at javax.servlet.http.HttpServlet.service(HttpServlet.java:733) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:227) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:542) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:143) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78) at org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:764) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:357) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:374) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:893) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1707) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.base/java.lang.Thread.run(Thread.java:829)
We do not expect any of the clients to throw NoAvailableLocatorsException because there is always at least one locator available during the test.
We did some investigation and found that:
- Locator Pods get different IP addresses on Kubernetes after they are restarted, but they keep the same hostname.
- After we see the NoAvailableLocatorsException thrown from a client, we see the client continues trying to contact the locators using stale IP addresses (i.e. the locators' original IP addresses from before they were restarted). We checked that the locators' DNS names are resolvable to the correct IP addresses from within the locator containers. We also ruled out the as JVM DNS cache settings as the cause of the stale IP addresses.
- The changes for
GEODE-9139changed the behavior of org.apache.geode.distributed.internal.tcpserver.HostAndPort to permanently cache the resolved address once it has tried one time. This undoes part of the fix introduced byGEODE-7808, in which HostAndPort was created as a way to hold an unresolved hostname.
In order to fix this issue, it seems like org.apache.geode.distributed.internal.tcpserver.HostAndPort should be changed so that when it contains an unresolved address, it will try to resolve the address each time getSocketInetAddress is called. This was the behavior in Geode 1.13 and 1.14, so changing it back shouldn't have a negative impact on performance.
Attachments
Issue Links
- links to