Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1568

AutoDetectReader performance problem

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.7
    • Fix Version/s: 1.22
    • Component/s: None
    • Labels:
      None

      Description

      Parsing performance of many text files suffers from repeated calls to ServiceLoader.loadServiceProviders(EncodingDetector.class). This happens in TXTParser, HTMLParser and SourceCodeParser. In most cases, when Tika is using the default ServiceLoader instance created in the Parser's static section this cost can be avoided by caching the resulting List<EncodingDetector> either at a higher level in the Parser (as a static property). If using custom ServiceLoader-s this can be achieved by putting this list in ParsingContext, or caching these lists at a lower level in the ServiceLoader component.

      Relevant part of the stacktrace follows:

         java.lang.Thread.State: BLOCKED (on object monitor)
      	at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
      	- locked <0x00000007909d2e48> (a java.util.jar.JarFile)
      	at java.util.jar.JarFile.getEntry(JarFile.java:227)
      	at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
      	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
      	at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
      	at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
      	at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
      	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
      	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
      	at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
      	at java.util.Collections.list(Collections.java:3687)
      	at org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
      	at org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
      	at org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
      	at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
      	at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
      	at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
      	at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
      	at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
      ...
      

        Attachments

          Activity

            People

            • Assignee:
              tallison@apache.org Tim Allison
              Reporter:
              ab Andrzej Bialecki
            • Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: