Lucene - Core
  1. Lucene - Core
  2. LUCENE-6482

Class loading deadlock relating to Codec initialization, default codec and SPI discovery

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.9.1
    • Fix Version/s: 5.2.1, 5.3, 6.0
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This issue came up for us several times with Elasticsearch 1.3.4 (Lucene 4.9.1), with many threads seeming deadlocked but RUNNABLE:

      "elasticsearch[search77-es2][generic][T#43]" #160 daemon prio=5 os_prio=0 tid=0x00007f79180c5800 nid=0x3d1f in Object.wait() [0x00007f79d9289000]
         java.lang.Thread.State: RUNNABLE
      	at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:359)
      	at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:457)
      	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:912)
      	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:758)
      	at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:453)
      	at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:98)
      	at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:126)
      	at org.elasticsearch.index.store.Store.access$300(Store.java:76)
      	at org.elasticsearch.index.store.Store$MetadataSnapshot.buildMetadata(Store.java:465)
      	at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:456)
      	at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:281)
      	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:186)
      	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:140)
      	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61)
      	at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:277)
      	at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:268)
      	at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      It didn't really make sense to see RUNNABLE threads in Object.wait(), but this seems to be symptomatic of deadlocks in static initialization (http://ternarysearch.blogspot.ru/2013/07/static-initialization-deadlock.html).

      I found LUCENE-5573 as an instance of this having come up with Lucene code before.

      I'm not sure what exactly is going on, but the deadlock in this case seems to involve these threads:

      "elasticsearch[search77-es2][clusterService#updateTask][T#1]" #79 daemon prio=5 os_prio=0 tid=0x00007f7b155ff800 nid=0xd49 in Object.wait() [0x00007f79daed8000]
         java.lang.Thread.State: RUNNABLE
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
      	at java.lang.Class.newInstance(Class.java:433)
      	at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:67)
      	- locked <0x000000061fef4968> (a org.apache.lucene.util.NamedSPILoader)
      	at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:47)
      	at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:37)
      	at org.apache.lucene.codecs.PostingsFormat.<clinit>(PostingsFormat.java:44)
      	at org.elasticsearch.index.codec.postingsformat.PostingFormats.<clinit>(PostingFormats.java:67)
      	at org.elasticsearch.index.codec.CodecModule.configurePostingsFormats(CodecModule.java:126)
      	at org.elasticsearch.index.codec.CodecModule.configure(CodecModule.java:178)
      	at org.elasticsearch.common.inject.AbstractModule.configure(AbstractModule.java:60)
      	- locked <0x000000061fef49e8> (a org.elasticsearch.index.codec.CodecModule)
      	at org.elasticsearch.common.inject.spi.Elements$RecordingBinder.install(Elements.java:204)
      	at org.elasticsearch.common.inject.spi.Elements.getElements(Elements.java:85)
      	at org.elasticsearch.common.inject.InjectorShell$Builder.build(InjectorShell.java:130)
      	at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:99)
      	- locked <0x000000061fef4c10> (a org.elasticsearch.common.inject.InheritingState)
      	at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131)
      	at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
      	at org.elasticsearch.indices.InternalIndicesService.createIndex(InternalIndicesService.java:296)
      	- locked <0x000000061fef4cd0> (a org.elasticsearch.indices.InternalIndicesService)
      	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewIndices(IndicesClusterStateService.java:312)
      	at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:181)
      	- locked <0x000000061fef4e70> (a java.lang.Object)
      	at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:444)
      	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      
      "elasticsearch[search77-es2][generic][T#1]" #80 daemon prio=5 os_prio=0 tid=0x00007f794400a000 nid=0xd4b in Object.wait() [0x00007f79dac56000]
         java.lang.Thread.State: RUNNABLE
      	at org.apache.lucene.codecs.simpletext.SimpleTextCodec.<init>(SimpleTextCodec.java:37)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
      	at java.lang.Class.newInstance(Class.java:433)
      	at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:67)
      	- locked <0x000000061fcf1f50> (a org.apache.lucene.util.NamedSPILoader)
      	at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:47)
      	at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:37)
      	at org.apache.lucene.codecs.Codec.<clinit>(Codec.java:41)
      	at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:359)
      	at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:457)
      	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:912)
      	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:758)
      	at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:453)
      	at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:98)
      	at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:126)
      	at org.elasticsearch.index.store.Store.access$300(Store.java:76)
      	at org.elasticsearch.index.store.Store$MetadataSnapshot.buildMetadata(Store.java:465)
      	at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:456)
      	at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:281)
      	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:186)
      	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:140)
      	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61)
      	at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:277)
      	at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:268)
      	at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      Full thread dump: https://gist.github.com/shikhar/d0f6d2d008f45d2d4f91

      1. CodecLoadingDeadlockTest.java
        1 kB
        Shikhar Bhushan
      2. LUCENE-6482.patch
        15 kB
        Uwe Schindler
      3. LUCENE-6482.patch
        13 kB
        Uwe Schindler
      4. LUCENE-6482.patch
        8 kB
        Uwe Schindler
      5. LUCENE-6482.patch
        7 kB
        Uwe Schindler
      6. LUCENE-6482.patch
        6 kB
        Uwe Schindler
      7. LUCENE-6482-failingtest.patch
        6 kB
        Uwe Schindler
      8. LUCENE-6482-failingtest.patch
        4 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment - - edited

          Hi,
          the problem here has nothing to do with NamedSPILoader, just how Codecs are initialized. The problem can be one of the following:

          • there is one codec in the classpath that uses a wrong initialization like the issue you mentioned (LUCENE-5573). The problematic thing is in most cases a Codec/PostingsFormat/DocValuesFormat.forName() in a static initializer. We also have this in Lucene, but the order is important here. I don't like this, but code is not manageable otherwise. So order of static class initialization is important. If one of the codecs hangs in one of such clinit locks, the stack trace is easy. All those threads that seem to hang while RUNNING are blocked, because they access a class that is currently in initialization phase (as you said).
          • Elasticsearch has several own codecs, maybe those had a bug as described before. 1.3.4 is a older one, maybe update to latest 1.3.9 version. We have never seen this with plain Lucene.

          In addition, make sure that you use latest JVM versions. several Java 7 realeases had class loading bugs with deadlocks (e.g. 1.7.0_25). To me it looks more like one of those issues, because otherwise other people would have reported bugs like this already.

          What is your Java version? Any special JVM settings?

          Uwe

          Show
          Uwe Schindler added a comment - - edited Hi, the problem here has nothing to do with NamedSPILoader, just how Codecs are initialized. The problem can be one of the following: there is one codec in the classpath that uses a wrong initialization like the issue you mentioned ( LUCENE-5573 ). The problematic thing is in most cases a Codec/PostingsFormat/DocValuesFormat.forName() in a static initializer. We also have this in Lucene, but the order is important here. I don't like this, but code is not manageable otherwise. So order of static class initialization is important. If one of the codecs hangs in one of such clinit locks, the stack trace is easy. All those threads that seem to hang while RUNNING are blocked, because they access a class that is currently in initialization phase (as you said). Elasticsearch has several own codecs, maybe those had a bug as described before. 1.3.4 is a older one, maybe update to latest 1.3.9 version. We have never seen this with plain Lucene. In addition, make sure that you use latest JVM versions. several Java 7 realeases had class loading bugs with deadlocks (e.g. 1.7.0_25). To me it looks more like one of those issues, because otherwise other people would have reported bugs like this already. What is your Java version? Any special JVM settings? Uwe
          Hide
          Shikhar Bhushan added a comment - - edited

          Uwe Schindler This was seen on JDK8u5, but I think this has also happened on JDK8u25 (not certain though...).

          The issue is not deterministic and comes up during cluster bounces sometimes, so it's hard to say whether an ES upgrade fixes it.

          You're probably right that this has nothing to do with NamedSPILoader but the classes being loaded. Is it possible to conclude from the thread dump whether an ES or Lucene Codec/PostingsFormat/etc is involved?

          Show
          Shikhar Bhushan added a comment - - edited Uwe Schindler This was seen on JDK8u5, but I think this has also happened on JDK8u25 (not certain though...). The issue is not deterministic and comes up during cluster bounces sometimes, so it's hard to say whether an ES upgrade fixes it. You're probably right that this has nothing to do with NamedSPILoader but the classes being loaded. Is it possible to conclude from the thread dump whether an ES or Lucene Codec/PostingsFormat/etc is involved?
          Hide
          Uwe Schindler added a comment -

          The issue is not deterministic and comes up during cluster bounces sometimes, so it's hard to say whether an ES upgrade fixes it.

          I have no idea what you mean with "cluster bounce", but this should only happen during startup of Elasticsearch nodes: From the last 2 stack traces, you see that there are 2 things happening in parallel: Loading of Codec.class (because an Index was opened), but in parallel, Elasticsearch seems to initialize the PostingsFormats.class in the class CodecModule (Elasticsearch). In my opinion, this should not happen in parallel, but a fix would maybe that CodecModule should also call Codecs.forName() so those classes are initialized sequentially at one single place. The problem with Codec.class and PostingsFormat.class clinit running in parallel in different threads may have similar effects like you have seen in the blog post (Codecs depend on PostingsFormat and some PostingsFormats depend on the Codec class, which then hangs if PostingsFormats and Codecs are initialized from 2 different threads at same time, waiting for each other). But we have no chance to prevent this (unfortunately).

          I cannot say for sure, but something seems to be fishy while initializing Elasticsearch, because there is too much happening at the same time. In my opinion, Codecs and Postingsformats and Docvalues classes should be initialized sequentially, but I have no idea how to enforce this.

          Show
          Uwe Schindler added a comment - The issue is not deterministic and comes up during cluster bounces sometimes, so it's hard to say whether an ES upgrade fixes it. I have no idea what you mean with "cluster bounce", but this should only happen during startup of Elasticsearch nodes: From the last 2 stack traces, you see that there are 2 things happening in parallel: Loading of Codec.class (because an Index was opened), but in parallel, Elasticsearch seems to initialize the PostingsFormats.class in the class CodecModule (Elasticsearch). In my opinion, this should not happen in parallel, but a fix would maybe that CodecModule should also call Codecs.forName() so those classes are initialized sequentially at one single place. The problem with Codec.class and PostingsFormat.class clinit running in parallel in different threads may have similar effects like you have seen in the blog post (Codecs depend on PostingsFormat and some PostingsFormats depend on the Codec class, which then hangs if PostingsFormats and Codecs are initialized from 2 different threads at same time, waiting for each other). But we have no chance to prevent this (unfortunately). I cannot say for sure, but something seems to be fishy while initializing Elasticsearch, because there is too much happening at the same time. In my opinion, Codecs and Postingsformats and Docvalues classes should be initialized sequentially, but I have no idea how to enforce this.
          Hide
          Shikhar Bhushan added a comment -

          Thanks Uwe Schindler, makes sense and it does not seem like a Lucene issue, so I'll close this.

          It might have been due to using a custom Elasticsearch discovery plugin which is purely asynchronous that those 2 bits ended up happening in parallel, and caused the deadlock.

          Show
          Shikhar Bhushan added a comment - Thanks Uwe Schindler , makes sense and it does not seem like a Lucene issue, so I'll close this. It might have been due to using a custom Elasticsearch discovery plugin which is purely asynchronous that those 2 bits ended up happening in parallel, and caused the deadlock.
          Hide
          Shikhar Bhushan added a comment -
          Show
          Shikhar Bhushan added a comment - Reopening as per discussion in https://github.com/elastic/elasticsearch/issues/11170
          Hide
          Uwe Schindler added a comment -

          Thanks for reopening. At the moment I am not 100% sure if not using Codec.forName() really solves the problem... I have to write some test that emulates concurrent inits on Codec/PostingsFormat/... This is a all a bit tricky and without knowing what the JVM really does on class init, its hard to tell. The blog post you posted says the same: looks like a bug in JVM but is hard to reproduce and explain why this happens.

          I may have to dig first. In any case, for now I would ensure to initialize the Codec/PostingsFormat classes early on startup as workaround. I just want to fully understand whats happening.

          Show
          Uwe Schindler added a comment - Thanks for reopening. At the moment I am not 100% sure if not using Codec.forName() really solves the problem... I have to write some test that emulates concurrent inits on Codec/PostingsFormat/... This is a all a bit tricky and without knowing what the JVM really does on class init, its hard to tell. The blog post you posted says the same: looks like a bug in JVM but is hard to reproduce and explain why this happens. I may have to dig first. In any case, for now I would ensure to initialize the Codec/PostingsFormat classes early on startup as workaround. I just want to fully understand whats happening.
          Hide
          Shikhar Bhushan added a comment -

          Uwe Schindler I have had some luck reproducing the problem quite consistently with the attached test. If you uncomment the first line in the main() so that Codec is previously initialized before the threads start, the deadlock doesn't happen.

          Show
          Shikhar Bhushan added a comment - Uwe Schindler I have had some luck reproducing the problem quite consistently with the attached test. If you uncomment the first line in the main() so that Codec is previously initialized before the threads start, the deadlock doesn't happen.
          Hide
          Uwe Schindler added a comment -

          Many thanks. Maybe my test was too complicated (I used IndexReaders to reproduce your usecase). This is much easier. I think it might be even easier to reproduce if you use a CyclicBarrier for starting both threads at the exact same time.

          Once I reproduced I will check into fixing this. I already had the idea to synchronize the ServiceLoader component by Lucene against a single static lock instance. But it is good to have a reproduce case. So I can better check that a fix is effective.

          Show
          Uwe Schindler added a comment - Many thanks. Maybe my test was too complicated (I used IndexReaders to reproduce your usecase). This is much easier. I think it might be even easier to reproduce if you use a CyclicBarrier for starting both threads at the exact same time. Once I reproduced I will check into fixing this. I already had the idea to synchronize the ServiceLoader component by Lucene against a single static lock instance. But it is good to have a reproduce case. So I can better check that a fix is effective.
          Hide
          Shikhar Bhushan added a comment -

          Thanks Uwe. I have actually not had a single occasion of not encountering the deadlock, just these lines do the trick every time

            public static void main(String... args) {
              final Thread t1 = new Thread(() -> Codec.getDefault());
              final Thread t2 = new Thread(() -> new SimpleTextCodec());
          
              t1.start();
              t2.start();
            }
          

          I am using JDK8u25.

          Show
          Shikhar Bhushan added a comment - Thanks Uwe. I have actually not had a single occasion of not encountering the deadlock, just these lines do the trick every time public static void main(String... args) { final Thread t1 = new Thread(() -> Codec.getDefault()); final Thread t2 = new Thread(() -> new SimpleTextCodec()); t1.start(); t2.start(); } I am using JDK8u25.
          Hide
          Uwe Schindler added a comment -

          Thanks! I will also look into SimpleTextCodec, because in all your stack traces, this codec was affected. Did you also try with other codecs?

          Show
          Uwe Schindler added a comment - Thanks! I will also look into SimpleTextCodec, because in all your stack traces, this codec was affected. Did you also try with other codecs?
          Hide
          Uwe Schindler added a comment -

          SimpleTextCodec alone cannot be the bad guy... I'll still digging.

          Show
          Uwe Schindler added a comment - SimpleTextCodec alone cannot be the bad guy... I'll still digging.
          Hide
          Uwe Schindler added a comment -

          I can easily reproduce this!

          I will now try to "synchronize" the serviceloader class, so we make sure that the classpath scanning is done sequentially.

          Show
          Uwe Schindler added a comment - I can easily reproduce this! I will now try to "synchronize" the serviceloader class, so we make sure that the classpath scanning is done sequentially.
          Hide
          Uwe Schindler added a comment - - edited

          I fails with every codec. The issue happens only if you call Codec.forName() at the same time as using the constructor of any Codec subclass.

          I have no idea how we should prevent that. I tried to synchronize NamedSPILoader, but that did not help. The problem is that this is a special type of deadlock that does not really involve "standard Java locks". It is more the JVM internal prevent to initialize a subclass before the parent class is initialized.

          Show
          Uwe Schindler added a comment - - edited I fails with every codec. The issue happens only if you call Codec.forName() at the same time as using the constructor of any Codec subclass. I have no idea how we should prevent that. I tried to synchronize NamedSPILoader, but that did not help. The problem is that this is a special type of deadlock that does not really involve "standard Java locks". It is more the JVM internal prevent to initialize a subclass before the parent class is initialized.
          Hide
          Uwe Schindler added a comment - - edited

          From digging around, the main problem is basically the following:

          • The JVM requires that a parent class must be initialized before the child class
          • But there is a "special case": A parent class is allowed to initialize subclasses from its static initializers (so stuff like our setup works). This is documented in the JLS.
          • But another thread is not allowed to initialize subclasses at the same time.

          This basically leads to the deadlock we are seeing here. We have a not yet fully initialized Codec parent class. The other thread is creating an instance of a subclass directly. But while initializing this subclass, it waits for the parent class to get available. But the parent class is currently scanning classpath and creating instances of all available codecs. While doing this it tries to create an instance of exactly the same class that the other thread is instantiating directly using new(). And this is the deadlock.

          Show
          Uwe Schindler added a comment - - edited From digging around, the main problem is basically the following: The JVM requires that a parent class must be initialized before the child class But there is a "special case": A parent class is allowed to initialize subclasses from its static initializers (so stuff like our setup works). This is documented in the JLS. But another thread is not allowed to initialize subclasses at the same time. This basically leads to the deadlock we are seeing here. We have a not yet fully initialized Codec parent class. The other thread is creating an instance of a subclass directly. But while initializing this subclass, it waits for the parent class to get available. But the parent class is currently scanning classpath and creating instances of all available codecs. While doing this it tries to create an instance of exactly the same class that the other thread is instantiating directly using new(). And this is the deadlock.
          Hide
          Uwe Schindler added a comment -

          The only workaround I see is the following:

          • Move the ServiceLoader, forName, and classpath scanning out of <clinit> of Codec (same for PostingsFormat and DocValuesFormat) into separate pkg-private classes (or static inner classes). Let Codec.forName() delegate there. In addition dont call Codec.forName() inside clinit of codec, so initialize the default codec via new.

          I will try to implement this, unfortunately the code is no longer as nice as now. I digged around: IBM's ICU has similar hacks to prevent this type of stuff while loading Locales or Charsets.

          Show
          Uwe Schindler added a comment - The only workaround I see is the following: Move the ServiceLoader, forName, and classpath scanning out of <clinit> of Codec (same for PostingsFormat and DocValuesFormat) into separate pkg-private classes (or static inner classes). Let Codec.forName() delegate there. In addition dont call Codec.forName() inside clinit of codec, so initialize the default codec via new . I will try to implement this, unfortunately the code is no longer as nice as now. I digged around: IBM's ICU has similar hacks to prevent this type of stuff while loading Locales or Charsets.
          Hide
          Uwe Schindler added a comment -

          Attached ypu will find a patch that solves the issue.

          I may only need to check SmokeTester that it can "identify" the new pattern. With this patch the deadlock is prevented, because a separate, hidden Helper class is used that holds 2 things: The Serviceloader and the default Codec. Initialization is delayed until is accessed first, so a deadlock can never happen.

          I had to remove the checks that code may call forName() from subclasses ctor (which now works), but may add the deadlock again. So I'll find a way to detect this (using asserts on stacktrace or whatever).

          Shikhar Bhushan: Can you apply the patch and try to check on your side?

          Show
          Uwe Schindler added a comment - Attached ypu will find a patch that solves the issue. I may only need to check SmokeTester that it can "identify" the new pattern. With this patch the deadlock is prevented, because a separate, hidden Helper class is used that holds 2 things: The Serviceloader and the default Codec. Initialization is delayed until is accessed first, so a deadlock can never happen. I had to remove the checks that code may call forName() from subclasses ctor (which now works), but may add the deadlock again. So I'll find a way to detect this (using asserts on stacktrace or whatever). Shikhar Bhushan : Can you apply the patch and try to check on your side?
          Hide
          Uwe Schindler added a comment -

          I checked SPI on the Tokenstream factories: There is no such issue:

          • we don't initialize the classes
          • we don't initialize instances until requested
          • we have no "default" tokenstreams or similar
          Show
          Uwe Schindler added a comment - I checked SPI on the Tokenstream factories: There is no such issue: we don't initialize the classes we don't initialize instances until requested we have no "default" tokenstreams or similar
          Hide
          Uwe Schindler added a comment - - edited

          New patch. I renamed the inner class to "Holder" and added Javadocs.

          The checks I removed are no longer an issue. This patch also prevents the deadlock that might happen if you call Codec.forName() from the constructor of a subclass. I would still not do this, but there is no more reason to check this - it cannot deadlock or NPE.

          Unfrotunately this is too late to get into 5.2, so I delay to 5.3 (or maybe 5.2.1). We should maybe put this into a 4.10.x release for those people that are affected by this and are still on 4.x. I will raise this issue's priority to "Critical".

          Elasticsearch should maybe for now use the "workaround" by calling Codec.availableCodecs() in its Boostrap class before init.

          Show
          Uwe Schindler added a comment - - edited New patch. I renamed the inner class to "Holder" and added Javadocs. The checks I removed are no longer an issue. This patch also prevents the deadlock that might happen if you call Codec.forName() from the constructor of a subclass. I would still not do this, but there is no more reason to check this - it cannot deadlock or NPE. Unfrotunately this is too late to get into 5.2, so I delay to 5.3 (or maybe 5.2.1). We should maybe put this into a 4.10.x release for those people that are affected by this and are still on 4.x. I will raise this issue's priority to "Critical". Elasticsearch should maybe for now use the "workaround" by calling Codec.availableCodecs() in its Boostrap class before init.
          Hide
          Uwe Schindler added a comment -

          I checked smoke tester. This one does not validate the Codec.forName() in Codec.java, so no changes needed.

          I think patch is ready. Many thanks to Shikhar Bhushan for reporting this, this is really a nasty bug

          Show
          Uwe Schindler added a comment - I checked smoke tester. This one does not validate the Codec.forName() in Codec.java, so no changes needed. I think patch is ready. Many thanks to Shikhar Bhushan for reporting this, this is really a nasty bug
          Hide
          Uwe Schindler added a comment - - edited

          This also fixes the bug that was investigated during LUCENE-4440: A filtercodec using forName() to create the delegate. This is no longer leading to issues (see above).

          Ignore this, the issue is still there on FilterCodec. But we have the warning in the Javadocs of FilterCodec already. I will improve the error message again...

          Show
          Uwe Schindler added a comment - - edited This also fixes the bug that was investigated during LUCENE-4440 : A filtercodec using forName() to create the delegate. This is no longer leading to issues (see above). Ignore this, the issue is still there on FilterCodec. But we have the warning in the Javadocs of FilterCodec already. I will improve the error message again...
          Hide
          Uwe Schindler added a comment -

          Revised patch which brings back the IllegalStateExceptions for people misusing FilterCodec.

          I think it's ready.

          Show
          Uwe Schindler added a comment - Revised patch which brings back the IllegalStateExceptions for people misusing FilterCodec. I think it's ready.
          Hide
          Uwe Schindler added a comment -

          New patch with a testcase that fails if the rest of patch is not applied (just apply LUCENE-6482-failingtest.patch to a plain checkout. This test spawns another JVM and spawns some threads to load codecs in parallel.

          It is similar to Shikhar Bhushan's test, but allows to run inside the "normal testsuite".

          Show
          Uwe Schindler added a comment - New patch with a testcase that fails if the rest of patch is not applied (just apply LUCENE-6482 -failingtest.patch to a plain checkout. This test spawns another JVM and spawns some threads to load codecs in parallel. It is similar to Shikhar Bhushan 's test, but allows to run inside the "normal testsuite".
          Hide
          Uwe Schindler added a comment -

          Final patch: I improved the test to be more random.

          I will now commit the failing test as a first try (to trunk), because I want to check if Jenkins fails Afterwards I will commit the fix.

          Show
          Uwe Schindler added a comment - Final patch: I improved the test to be more random. I will now commit the failing test as a first try (to trunk), because I want to check if Jenkins fails Afterwards I will commit the fix.
          Hide
          ASF subversion and git services added a comment -

          Commit 1684005 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1684005 ]

          LUCENE-6482: Commit failing test first and wait for Jenkins failure (to prove that it works also on Unix and Linux)

          Show
          ASF subversion and git services added a comment - Commit 1684005 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1684005 ] LUCENE-6482 : Commit failing test first and wait for Jenkins failure (to prove that it works also on Unix and Linux)
          Hide
          ASF subversion and git services added a comment -

          Commit 1684006 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1684006 ]

          LUCENE-6482: Fix class loading deadlock relating to Codec initialization, default codec and SPI discovery

          Show
          ASF subversion and git services added a comment - Commit 1684006 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1684006 ] LUCENE-6482 : Fix class loading deadlock relating to Codec initialization, default codec and SPI discovery
          Hide
          ASF subversion and git services added a comment -

          Commit 1684007 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1684007 ]

          Merged revision(s) 1684005-1684006 from lucene/dev/trunk:
          LUCENE-6482: Commit failing test first and wait for Jenkins failure (to prove that it works also on Unix and Linux)
          LUCENE-6482: Fix class loading deadlock relating to Codec initialization, default codec and SPI discovery
          (+ Java 7 change: move away from Lambdas and streams API)

          Show
          ASF subversion and git services added a comment - Commit 1684007 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1684007 ] Merged revision(s) 1684005-1684006 from lucene/dev/trunk: LUCENE-6482 : Commit failing test first and wait for Jenkins failure (to prove that it works also on Unix and Linux) LUCENE-6482 : Fix class loading deadlock relating to Codec initialization, default codec and SPI discovery (+ Java 7 change: move away from Lambdas and streams API)
          Hide
          Uwe Schindler added a comment -

          I committed the changes and backported to 5.x (without Lambdas).

          Should we backport to 4.10.x and 5.2.x branch, too? It would be good to hear from the Elasticsearch people if this is required or if they go with the "simple fix" by calling Codecs.availableCodecs() in Bootstrap.java without concurrency?

          Show
          Uwe Schindler added a comment - I committed the changes and backported to 5.x (without Lambdas). Should we backport to 4.10.x and 5.2.x branch, too? It would be good to hear from the Elasticsearch people if this is required or if they go with the "simple fix" by calling Codecs.availableCodecs() in Bootstrap.java without concurrency?
          Hide
          ASF subversion and git services added a comment -

          Commit 1684008 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1684008 ]

          LUCENE-6482: Fix Java 7, which has very limited Process API (I did not notice before, as I had Java 8 as default runtime)

          Show
          ASF subversion and git services added a comment - Commit 1684008 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1684008 ] LUCENE-6482 : Fix Java 7, which has very limited Process API (I did not notice before, as I had Java 8 as default runtime)
          Hide
          ASF subversion and git services added a comment -

          Commit 1684010 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1684010 ]

          LUCENE-6482: Improve Java 7 fix

          Show
          ASF subversion and git services added a comment - Commit 1684010 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1684010 ] LUCENE-6482 : Improve Java 7 fix
          Hide
          ASF subversion and git services added a comment -

          Commit 1684024 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1684024 ]

          LUCENE-6482: Fix detection of default codec for documentation homepage

          Show
          ASF subversion and git services added a comment - Commit 1684024 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1684024 ] LUCENE-6482 : Fix detection of default codec for documentation homepage
          Hide
          ASF subversion and git services added a comment -

          Commit 1684025 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1684025 ]

          Merged revision(s) 1684024 from lucene/dev/trunk:
          LUCENE-6482: Fix detection of default codec for documentation homepage

          Show
          ASF subversion and git services added a comment - Commit 1684025 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1684025 ] Merged revision(s) 1684024 from lucene/dev/trunk: LUCENE-6482 : Fix detection of default codec for documentation homepage
          Hide
          Uwe Schindler added a comment -

          I will backport this to 5.2.1, because we will have a bugfix release: Regression in Solr

          Show
          Uwe Schindler added a comment - I will backport this to 5.2.1, because we will have a bugfix release: Regression in Solr
          Hide
          ASF subversion and git services added a comment -

          Commit 1684242 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_2'
          [ https://svn.apache.org/r1684242 ]

          Merged revision(s) 1684007-1684008, 1684010, 1684025 from lucene/dev/branches/branch_5x:
          LUCENE-6482: Commit failing test first and wait for Jenkins failure (to prove that it works also on Unix and Linux)
          LUCENE-6482: Fix class loading deadlock relating to Codec initialization, default codec and SPI discovery
          (+ Java 7 change: move away from Lambdas and streams API)
          LUCENE-6482: Fix Java 7, which has very limited Process API (I did not notice before, as I had Java 8 as default runtime)
          LUCENE-6482: Improve Java 7 fix
          LUCENE-6482: Fix detection of default codec for documentation homepage

          Show
          ASF subversion and git services added a comment - Commit 1684242 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_2' [ https://svn.apache.org/r1684242 ] Merged revision(s) 1684007-1684008, 1684010, 1684025 from lucene/dev/branches/branch_5x: LUCENE-6482 : Commit failing test first and wait for Jenkins failure (to prove that it works also on Unix and Linux) LUCENE-6482 : Fix class loading deadlock relating to Codec initialization, default codec and SPI discovery (+ Java 7 change: move away from Lambdas and streams API) LUCENE-6482 : Fix Java 7, which has very limited Process API (I did not notice before, as I had Java 8 as default runtime) LUCENE-6482 : Improve Java 7 fix LUCENE-6482 : Fix detection of default codec for documentation homepage
          Hide
          Shikhar Bhushan added a comment -

          Thanks for fixing this Uwe Schindler! Great digging on what was going on. The fix and the test looks good to me.

          Show
          Shikhar Bhushan added a comment - Thanks for fixing this Uwe Schindler ! Great digging on what was going on. The fix and the test looks good to me.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Shikhar Bhushan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development