Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2949

Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.19
    • 1.19
    • net, plugin, protocol
    • None

    Description

      While running a custom Nutch job (code here), many but not all task failed exceeding the the Hadoop task time-out (`mapreduce.task.timeout`) without generating any "heartbeat" (output, counter increments, log messages). Hadoop logs the stacks of all threads of the timed out task. That's the base for the excerpts below.

      The job runs a MultithreadedMapper - most of the mapper threads (48 in total) are waiting for the URLStreamHandler in order to construct a java.net.URL object:

      "Thread-11" #27 prio=5 os_prio=0 cpu=243.78ms elapsed=647.25s tid=0x00007f3eb5b0f800 nid=0x8e651 waiting for monitor entry  [0x00007f3e84ef9000]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
              - waiting to lock <0x00000006a1bc0630> (a java.lang.String)
              at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
              at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
              at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:651)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:541)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:488)
              at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
              at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
              at org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
              at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
              at org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274)
      

      Only a single mapper thread is active:

      "Thread-23" #39 prio=5 os_prio=0 cpu=5830.17ms elapsed=647.09s tid=0x00007f3eb5b42800 nid=0x8e661 in Object.wait()  [0x00007f3e842ec000]
         java.lang.Thread.State: RUNNABLE
              at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(java.base@11.0.15/Native Method)
              at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(java.base@11.0.15/NativeConstructorAccessorImpl.java:62)
              at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(java.base@11.0.15/DelegatingConstructorAccessorImpl.java:45)
              at java.lang.reflect.Constructor.newInstance(java.base@11.0.15/Constructor.java:490)
              at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170)
              - locked <0x00000006a1bc0630> (a java.lang.String)
              at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
              at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
              at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:651)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:541)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:488)
              at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
              at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
              at org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
              at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
              at org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274)
      

      The stack of one thread shows a nested lock:

      "Thread-26" #42 prio=5 os_prio=0 cpu=24.93ms elapsed=647.07s tid=0x00007f3eb5b49800 nid=0x8e664 waiting for monitor entry  [0x00007f3e83fe7000]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
              - waiting to lock <0x00000006a1bc0630> (a java.lang.String)
              at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
              at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
              at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:651)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:541)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:488)
              at javax.crypto.JceSecurity.<clinit>(java.base@11.0.15/JceSecurity.java:239)
              at javax.crypto.Cipher.getInstance(java.base@11.0.15/Cipher.java:540)
              at sun.security.ssl.JsseJce.getCipher(java.base@11.0.15/JsseJce.java:190)
              at sun.security.ssl.SSLCipher.isTransformationAvailable(java.base@11.0.15/SSLCipher.java:509)
              at sun.security.ssl.SSLCipher.<init>(java.base@11.0.15/SSLCipher.java:498)
              at sun.security.ssl.SSLCipher.<clinit>(java.base@11.0.15/SSLCipher.java:81)
              at sun.security.ssl.CipherSuite.<clinit>(java.base@11.0.15/CipherSuite.java:65)
              at sun.security.ssl.SSLContextImpl.getApplicableSupportedCipherSuites(java.base@11.0.15/SSLContextImpl.java:348)
              at sun.security.ssl.SSLContextImpl$AbstractTLSContext.<clinit>(java.base@11.0.15/SSLContextImpl.java:580)
              at java.lang.Class.forName0(java.base@11.0.15/Native Method)
              at java.lang.Class.forName(java.base@11.0.15/Class.java:315)
              at java.security.Provider$Service.getImplClass(java.base@11.0.15/Provider.java:1918)
              at java.security.Provider$Service.newInstance(java.base@11.0.15/Provider.java:1894)
              at sun.security.jca.GetInstance.getInstance(java.base@11.0.15/GetInstance.java:236)
              at sun.security.jca.GetInstance.getInstance(java.base@11.0.15/GetInstance.java:164)
              at javax.net.ssl.SSLContext.getInstance(java.base@11.0.15/SSLContext.java:168)
              at org.apache.nutch.protocol.okhttp.OkHttp.<clinit>(OkHttp.java:97)
              at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(java.base@11.0.15/Native Method)
              at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(java.base@11.0.15/NativeConstructorAccessorImpl.java:62)
              at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(java.base@11.0.15/DelegatingConstructorAccessorImpl.java:45)
              at java.lang.reflect.Constructor.newInstance(java.base@11.0.15/Constructor.java:490)
              at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170)
              - locked <0x00000006a1bc0410> (a java.lang.String)
              at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
              at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
              at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:651)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:541)
              at java.net.URL.<init>(java.base@11.0.15/URL.java:488)
              at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
              at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
              at org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
              at org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
              at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
              at org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274)
      

      It isn't a dead-lock because the lock is on two different String objects in different lines of code. However, a recursive barrier may make the performance impact of locks much worse. The stack also shows that Nutch's URLStreamHandlerFactory is also called from Java-internal methods. The cryptographic classes in the stack could also give a hint what's going wrong in NUTCH-2936.

      I'm not yet clear what the solution could be:

      • Extension.getExtensionInstance() with nested synchronization is for sure not suitable to be called frequently
      • maybe caching the URL stream handlers solves the problem
      • but we should keep also NUTCH-2936 on the radar when looking for solutions

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: