Description
The constructor of the class o.a.n.protocol.Content instantiates a new MimeUtil object. That's not cheap as it always creates a new Tika object and there is a lock on the job/jar file when config files are read:
"FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800 nid=0x1de2 waiting for monitor entry [0x00007f70193a8000] java.lang.Thread.State: BLOCKED (on object monitor) at java.util.zip.ZipFile.getEntry(ZipFile.java:314) - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile) at java.util.jar.JarFile.getEntry(JarFile.java:240) at java.util.jar.JarFile.getJarEntry(JarFile.java:223) at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042) at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020) at sun.misc.URLClassPath$1.next(URLClassPath.java:267) at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277) at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601) at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader$3.next(URLClassLoader.java:598) at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623) at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45) at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54) at java.util.Collections.list(Collections.java:5239) at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325) at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352) at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274) at org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45) at org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92) at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248) at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386) at org.apache.tika.Tika.<init>(Tika.java:116) at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69) at org.apache.nutch.protocol.Content.<init>(Content.java:83) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316) at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
If there are many Fetcher threads this may cause a significant bottleneck, running a Fetcher with 120 threads I've found up to 50 threads waiting for this lock:
# pid 7195 is a Fetcher map task % sudo -u yarn jstack 7195 \ | grep -A25 'waiting to lock' \ | grep -F 'org.apache.tika.Tika.<init>' \ | wc -l 49
As MimeUtil is thread-safe including the called Tika detector, the best solution seems to cache the MimeUtil object in the actual protocol implementation as it is done in Nutch 2.x (lib-http HttpBase, line #151).
Attachments
Issue Links
- relates to
-
TIKA-2645 Reuse SAXParsers where possible
- Resolved
- links to