Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.25
-
None
-
Hide
PID TTY STAT TIME COMMAND
{{ 1 ? Ssl 2:54 java -jar /tika-server-1.25.jar -h 0.0.0.0 -spawnChild TIKA_CHILD_JVM_OPTS=-JXmx3g -JXX:+ExitOnOutOfMemoryError -status}}
{{ 127 ? Z 1:15 [tesseract] <defunct>}}
{{ 131 ? Z 1:31 [tesseract] <defunct>}}
{{ 139 ? Z 0:16 [tesseract] <defunct>}}
{{ 219 ? Z 40:20 [tesseract] <defunct>}}
{{ 324 ? Z 2:32 [tesseract] <defunct>}}
{{ 342 ? Z 0:09 [tesseract] <defunct>}}
{{ 343 ? Z 0:09 [tesseract] <defunct>}}
{{ 380 ? Z 0:00 [tesseract] <defunct>}}
{{ 430 ? Z 9:18 [tesseract] <defunct>}}
{{ 435 ? Z 4:52 [tesseract] <defunct>}}
{{ 446 ? Z 18:34 [tesseract] <defunct>}}
{{ 453 ? Z 22:33 [tesseract] <defunct>}}
{{ 461 ? Z 0:25 [tesseract] <defunct>}}
{{ 517 ? Z 4:35 [tesseract] <defunct>}}
{{ 526 ? Z 0:04 [tesseract] <defunct>}}
{{ 536 ? S 0:00 36:39}}
{{ 881708 pts/0 Ss 0:00 bash}}
{{ 883782 ? Sl 0:00 java -XX:+ExitOnOutOfMemoryError -Djava.awt.headless=true -cp /tika-server-1.25.jar org.apache.tika.server.TikaServerCli -h 0.0.0.0 TIKA_CHILD_JVM_OPTS=-JXmx3g -status --id ab854166-0a4b-4ecc-93a5-e4504d11e4e8}}
{{ 883801 pts/0 R+ 0:00 ps ax}}ShowPID TTY STAT TIME COMMAND {{ 1 ? Ssl 2:54 java -jar /tika-server-1.25.jar -h 0.0.0.0 -spawnChild TIKA_CHILD_JVM_OPTS=-JXmx3g -JXX:+ExitOnOutOfMemoryError -status}} {{ 127 ? Z 1:15 [tesseract] <defunct>}} {{ 131 ? Z 1:31 [tesseract] <defunct>}} {{ 139 ? Z 0:16 [tesseract] <defunct>}} {{ 219 ? Z 40:20 [tesseract] <defunct>}} {{ 324 ? Z 2:32 [tesseract] <defunct>}} {{ 342 ? Z 0:09 [tesseract] <defunct>}} {{ 343 ? Z 0:09 [tesseract] <defunct>}} {{ 380 ? Z 0:00 [tesseract] <defunct>}} {{ 430 ? Z 9:18 [tesseract] <defunct>}} {{ 435 ? Z 4:52 [tesseract] <defunct>}} {{ 446 ? Z 18:34 [tesseract] <defunct>}} {{ 453 ? Z 22:33 [tesseract] <defunct>}} {{ 461 ? Z 0:25 [tesseract] <defunct>}} {{ 517 ? Z 4:35 [tesseract] <defunct>}} {{ 526 ? Z 0:04 [tesseract] <defunct>}} {{ 536 ? S 0:00 36:39}} {{ 881708 pts/0 Ss 0:00 bash}} {{ 883782 ? Sl 0:00 java -XX:+ExitOnOutOfMemoryError -Djava.awt.headless=true -cp /tika-server-1.25.jar org.apache.tika.server.TikaServerCli -h 0.0.0.0 TIKA_CHILD_JVM_OPTS=-JXmx3g -status --id ab854166-0a4b-4ecc-93a5-e4504d11e4e8}} {{ 883801 pts/0 R+ 0:00 ps ax}}
Description
Tika seems to be stuck in a loop trying to bind. Please see the attached logs.
It is also using a lot of CPU while doing that, so in an environment with horizontal scaling it causes other containers to spin up.
This seems to happen from time to time to one of the multiple containers running, normally everything runs smoothly.
Not sure why it fails to bind in the first place (it's running in docker).
I suspect that it may be related to the fact that it is trying to bind very quickly.
Ideas on how to debug this issue further are welcome.
Attachments
Attachments
- bind_issue.txt
- 621 kB
- Cristian Zamfir
- logs_before.txt
- 12 kB
- Cristian Zamfir
Activity
Sorry, it is 1.25, I selected the wrong version above.
It is not easy to reproduce, but on the other hand it is not super rare. I have seen it twice already in about a 7 days period where I have 1-6 containers running (depending on load).
Is there any logging before the infinite loops so we can see what might have brought it down initially?
Is there any chance something like this is happening in your environment: https://github.com/docker/for-linux/issues/1180
Or, I’m wondering if the OS is not cleaning up the port quickly enough, as it sounds like you suggested? At the least, we should put a pause after a failed restart if the port is bound. Ideally, we’d log the process that was binding to the port.
I attached the logs just before. Not sure if it is a coincidence, but it does happen after a parse timeout hits.
I would say that it is not an issue like the one you mentioned with docker. The container is not restarted, instead the Tika spawned child is restarted due to the timeout.
Actually it is not clear why after a parse timeout it's not just the child that restarts, so the 9998 port should still be in use by the parent, is that right?
The port is used only by the child. The parent just monitors the sub process.
To confirm, the event that triggered this chain was a child timeout?
Oops… sorry on phone and missed your earlier comment. That’ll help when I’m back to keyboard. Thank you!
A side question - if the parent process only monitors the child, I presume it's safe to assume that it does not need a lot of memory. Can I assume I only need 200MB for the parent?
A pause after restart on bind issues looks like a good thing to have in any case. But the child restarts successfully e.g., on OOM without hitting this issues.
For a docker setup I can solve it with a liveness check on the /status api endpoint, but it would be nice to also know more about the underlying issue causing this.
Sorry, I see the original logs you attached, but that's where the problem starts, right?
2021-06-09T11:43:22.472293991+02:00 ERROR Could not start Jetty server on port 9,998: Failed to bind to /0.0.0.0:9998
Was there a timeout before: 2021-06-09T11:43:21.647771746+02:00?
I attached now another file that somehow did not get attached before. The new logs show the first time the bind error happened.
I can't explain why the port was still in use. My guess is that the operating system hadn't yet released it.
I did find a race condition in that the forked process writes "success" to the forking process before the server is successfully launched. I've fixed this in 1.x.
I've added a check for a bind exception, and the forking process will now retry on a bind exception 5 times (waiting 1 second between tries).
This part of the code is quite different in 2.x, but I'll review that to see if we need to change anything there.
I don't know if you're able to pull a snapshot version of 1.27 and try it out. If you could, that'd be great.
ndipiazza_gmail, have you seen this? You're working with the 2.x branch, right?
I'm wondering if the spawned/abandoned tesseract processes are keeping the port open?
What may be happening is that your process had a TCP port open when it crashed or otherwise exited without explicitly closing it. Normally the OS cleans up these sorts of things, but only when the process record goes away. While the process may not appear to be running any more, there is at least one thing that can keep a record of it around, in order to prevent reuse of its PID. This is the existence of a child process that is not detached from the parent.
https://serverfault.com/questions/181015/how-do-you-free-up-a-port-being-held-open-by-dead-process
No we have not seen this. We do not have a huge amount of people using the Tesseract though.
I have still been primarily working on Tika 1.x.
I could try to reproduce.
zamf can you clarify some more what are the exact steps to reproduce? From the existing details in the ticket I'm having a hard time understand what your exact steps are to reproduce.
SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #134 (See https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/134/)
TIKA-3441 – prevent infinite loop on failure to bind to a port (tallison: https://github.com/apache/tika/commit/fd98eeefc968b7e564d5446ae774f7f1f21aed93)
- (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
- (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java
TIKA-3441– improve likelihood that tesseract processes will be shutdown on crash. (tallison: https://github.com/apache/tika/commit/d7fa2cd284a0d400a1ef29f7111018bb16b1cc5d) - (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
tallison - unfortunately there is no container built for 1.27 available un hub.docker.com so it makes it harder to reproduce. I will upgrade to 1.26 though.
tallison - I think the defunct tesseract processes could explain it. Since they were lingering there for a while, looks like the solution is to explicitly kill them.
- I am using the docker image apache/tika:latest-full (side note: this appears to be 1.25 instead of 1.26, but most likely the intention is to use 1.26 in latest).
- The custom parameters (can also be seen in the environment field in Jira) are: -spawnChild TIKA_CHILD_JVM_OPTS=-JXmx3g -JXX:+ExitOnOutOfMemoryError -status"
- To reproduce, just feed it a lot of files. In this particular case the issue appeared after a few hours and after a parse timeout error. That timeout caused a restart of the tika child which then got the bind error.
- Tesseract is enabled using the default settings in the apache docker image.
- I mentioned some other environment factors like k8s horizontal scaling, you can scrape that, I am pretty sure that this is an issue with the OS keeping the port open so the issue can manifest with a single tika docker container.
Let me see if I can reproduce locally. The changes I made should help some, and we can add a shutdown hook, but I don’t think there’s a way around an os’s oom killer, for example. Still, we should do what we can.
I caught another instance of this issue happening and killed this defunct process in the container:
2498 ? Z 0:00 [jspawnhelper] <defunct>
and after that, the child process was able to get past the bind error.
On Ubuntu 20.04.2 LTS, with openjdk version "1.8.0_292" and "openjdk version '14.0.2' 2020-07-14", and the 1.25 server, the tesseract processes are orphaned, but the restart is able to reclaim the port. Obv, we want to 1) prevent orphaning tesseract to the best of our ability and 2) we absolutely need to prevent infinite loops when the port can't be bound, but I'm wondering if we should experiment with different OS images for our docker container so that orphaned processes don't prevent port binding?
It does not reproduce that easily. Among 6 containers each having several tens of restarts per day, I only saw this issue 3 times.
I've added shutdown hooks to TesseractOCRParser so that if the jvm shuts itself down, that should prevent orphaned tesseract processes.
I was still able to orphan tesseract processes if I kill -9'd the forked process. I'm not sure how to handle that.
I was not able to reproduce the infinite loop, but, as you said that should be rare.
As mentioned above, I added code to prevent infinite loops on failed restart. So, if there is a failure to bind within 10 seconds (we should make that configurable?). The tika-server won't restart.
I added a waitFor(10 seconds) in the forking process before forcibly destroying child process. This should increase the chances that the child process will terminate itself and thereby clean up orphanable tesseract processes.
I noticed there was resource leakage in the child process's shutdown hooks. With each restart, another shutdown hook was added. The code intended to remove the previous and then add a new. That was not working, which means that there were 20464 thread objects in the shutdown hook map in the attached logs. This is working now.
After the addition of the shutdownhooks in TesseractOCRParser (e.g. when you migrate to 1.27), you should NOT use -JXX:+ExitOnOutOfMemoryError because this shutsdown the jvm without calling the shutdownhooks and tesseract will be orphaned.
SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #135 (See https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/135/)
TIKA-3441 – improve likelihood that tesseract processes will be shutdown on jvm restart. (tallison: https://github.com/apache/tika/commit/1224f881a1dc54282281cdf2fc9ecf3f3e429393)
- (edit) CHANGES.txt
- (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
- (add) tika-core/src/main/java/org/apache/tika/parser/AbstractExternalProcessParser.java
- (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java
- (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java
With the exception of this gem with Java 14:
WARN [qtp1256893889-32] 12:27:19,572 org.eclipse.jetty.server.HttpChannel /rmeta java.lang.NoClassDefFoundError: Could not initialize class java.io.DeleteOnExitHook at java.io.File.deleteOnExit(File.java:1094) ~[?:?] at org.apache.cxf.helpers.FileUtils.createTmpDir(FileUtils.java:161) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.helpers.FileUtils.getDefaultTempDir(FileUtils.java:88) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.helpers.FileUtils.createTempFile(FileUtils.java:269) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.helpers.FileUtils.createTempFile(FileUtils.java:263) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.io.CachedOutputStream.createFileOutputStream(CachedOutputStream.java:475) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.io.CachedOutputStream.enforceLimits(CachedOutputStream.java:437) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.io.CachedOutputStream.write(CachedOutputStream.java:445) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.helpers.IOUtils.copy(IOUtils.java:193) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.helpers.IOUtils.copy(IOUtils.java:146) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.io.DelegatingInputStream.cacheInput(DelegatingInputStream.java:54) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.http.AbstractHTTPDestination$1.cacheInput(AbstractHTTPDestination.java:321) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.http.AbstractHTTPDestination.cacheInput(AbstractHTTPDestination.java:600) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.http.AbstractHTTPDestination.flushHeaders(AbstractHTTPDestination.java:622) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.flushHeaders(JettyHTTPDestination.java:272) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.http.AbstractHTTPDestination$WrappedOutputStream.close(AbstractHTTPDestination.java:784) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.http.AbstractHTTPDestination$BackChannelConduit.close(AbstractHTTPDestination.java:722) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:63) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135) ~[tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882) [tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036) [tika-server-standard-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at java.lang.Thread.run(Thread.java:832) [?:?]
SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #258 (See https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/258/)
TIKA-3441 – increase chances that TesseractOCRParser will not orphan tesseract process and ensure that tika-server never enters an infinite loop on a bind exception. (tallison: https://github.com/apache/tika/commit/51fb33f3067bda551905056110bcf251d4bb3d23)
- (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
- (edit) tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerWatchDog.java
- (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesClient.java
- (add) tika-core/src/main/java/org/apache/tika/parser/AbstractExternalProcessParser.java
- (edit) tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java
Are you using 1.25 or 1.26? Is this fairly rare or easily reproducible?