Solr
  1. Solr
  2. SOLR-3424

PhoneticFilterFactory threadsafety bug

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      PhoneticFilterFactory has a static HashMap registry mapping an encoder name to an implementation. There is a ReentrantLock used when the map is modified (when the encoder config specifies a class name). However, this map, which can be accessed by multiple indexing threads, isn't guarded on any of the reads, which isn't just the common path but also the error messages which dump the registry into the error message.

      I realize the likelihood of a problem is extremely slim, but a bug's a bug.

        Activity

        Hide
        David Smiley added a comment -

        Committed to trunk in 1334544.

        I left the commons-coded javadoc url as-is since it was finally updated to the latest version.

        Show
        David Smiley added a comment - Committed to trunk in 1334544. I left the commons-coded javadoc url as-is since it was finally updated to the latest version.
        Hide
        Uwe Schindler added a comment -

        Hi David,
        just commit it, I am about to leave for business trips (including Lucene Rev) so have not much time.
        Uwe

        Show
        Uwe Schindler added a comment - Hi David, just commit it, I am about to leave for business trips (including Lucene Rev) so have not much time. Uwe
        Hide
        David Smiley added a comment -

        Looks good Uwe.
        One minor nit is that the URL in the class Javadoc to Commons-Codec's Language package should be this URL:
        http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/package-summary.html which is the one to 1.6; the existing link is older with fewer classes. We've got this link in both the FilterFactory & Filter.

        Feel free to commit if you want or just leave it to me.

        Show
        David Smiley added a comment - Looks good Uwe. One minor nit is that the URL in the class Javadoc to Commons-Codec's Language package should be this URL: http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/package-summary.html which is the one to 1.6; the existing link is older with fewer classes. We've got this link in both the FilterFactory & Filter. Feel free to commit if you want or just leave it to me.
        Hide
        Uwe Schindler added a comment - - edited

        Hi David,
        I don't really trust that thread so I would really go with the current solution. In Lucene 4.0, reuse of TokenStreams is enforced so there is no cost at all by the reflection and the createInstance. I also improved the reflective method call (which has the side effect of being able to print a useful message when the encoder does not support setMaxLen, the old code throwed unspecified and unhelpful Exception).

        In general, the commons encoders might be thread safe, but we use reflection and use any class the user provides and other classes implementing the abstract Encoder interface may not be thread safe (unless this is explicitely specified in their spec / javadocs - which isn't)

        As it costs us nothing (Class.newInstance() is cheap and only called once a new thread is used and the SolrAnalyzer class has to create new TokenStreamComponents), we should stay with the current code. The thread-safety is then ensured by the Analyzer class (it uses thread-locals to reuse TokenStreams).

        What do you think about my latest patch?

        Show
        Uwe Schindler added a comment - - edited Hi David, I don't really trust that thread so I would really go with the current solution. In Lucene 4.0, reuse of TokenStreams is enforced so there is no cost at all by the reflection and the createInstance. I also improved the reflective method call (which has the side effect of being able to print a useful message when the encoder does not support setMaxLen, the old code throwed unspecified and unhelpful Exception). In general, the commons encoders might be thread safe, but we use reflection and use any class the user provides and other classes implementing the abstract Encoder interface may not be thread safe (unless this is explicitely specified in their spec / javadocs - which isn't) As it costs us nothing (Class.newInstance() is cheap and only called once a new thread is used and the SolrAnalyzer class has to create new TokenStreamComponents), we should stay with the current code. The thread-safety is then ensured by the Analyzer class (it uses thread-locals to reuse TokenStreams). What do you think about my latest patch?
        Hide
        David Smiley added a comment -

        So it appears the Commons-Codec Encoders are all thread-safe after all:
        http://apache-commons.680414.n4.nabble.com/Thread-safety-of-Commons-Codec-Encoder-tt4600956.html

        Show
        David Smiley added a comment - So it appears the Commons-Codec Encoders are all thread-safe after all: http://apache-commons.680414.n4.nabble.com/Thread-safety-of-Commons-Codec-Encoder-tt4600956.html
        Hide
        Uwe Schindler added a comment -

        Hi, I looked again at the code and found the following improvement, attached as new patch:

        The reflection based-method lookup should be done in the initialization. The pointer to the method is then saved in a instance field, so the getEncoder() method only needs to invoke. This removes more than half of the reflection cost.
        The pointer to the method is null, if no maxLength is set.
        I also did some minor cleanups in Exception handling.

        Show
        Uwe Schindler added a comment - Hi, I looked again at the code and found the following improvement, attached as new patch: The reflection based-method lookup should be done in the initialization. The pointer to the method is then saved in a instance field, so the getEncoder() method only needs to invoke. This removes more than half of the reflection cost. The pointer to the method is null, if no maxLength is set. I also did some minor cleanups in Exception handling.
        Hide
        Uwe Schindler added a comment -

        There seems to be no Analyzer in Lucene that uses this class. The phonetic package only contains the plain TokenFilters and those are single-thread only.

        Show
        Uwe Schindler added a comment - There seems to be no Analyzer in Lucene that uses this class. The phonetic package only contains the plain TokenFilters and those are single-thread only.
        Hide
        Uwe Schindler added a comment -

        +1

        I wanted to review the stock Lucene Analyzer about thread safety, too.

        Show
        Uwe Schindler added a comment - +1 I wanted to review the stock Lucene Analyzer about thread safety, too.
        Hide
        David Smiley added a comment -

        I'll commit it in ~24 hours.

        Show
        David Smiley added a comment - I'll commit it in ~24 hours.
        Hide
        David Smiley added a comment -

        My new patch creates & initializes the encoder per call to create(). The testSpeed() tests seem about the same. I also added a test for the maxCodeLen arg, and i documented it too.

        Show
        David Smiley added a comment - My new patch creates & initializes the encoder per call to create(). The testSpeed() tests seem about the same. I also added a test for the maxCodeLen arg, and i documented it too.
        Hide
        David Smiley added a comment -

        Uwe: Excellent catch RE my "static" bug, and I agree on your improvement RE clazz & null.

        I tried to ascertain the thread-safety status of Commons-Codec and unfortunately I don't think we can assume it's thread-safe. I sent an email to their list about this just now: http://apache-commons.680414.n4.nabble.com/Thread-safety-of-Commons-Codec-Encoder-tt4600956.html

        So I agree that we'll have to insatiate one of these in our create() method instead of caching it.

        Show
        David Smiley added a comment - Uwe: Excellent catch RE my "static" bug, and I agree on your improvement RE clazz & null. I tried to ascertain the thread-safety status of Commons-Codec and unfortunately I don't think we can assume it's thread-safe. I sent an email to their list about this just now: http://apache-commons.680414.n4.nabble.com/Thread-safety-of-Commons-Codec-Encoder-tt4600956.html So I agree that we'll have to insatiate one of these in our create() method instead of caching it.
        Hide
        Uwe Schindler added a comment -

        Updated patch, I added a testcase for the reflection without package name:

        • I use "Caverphone2" as package less name, which is not in the REGISTRY. The returned class should be Caverphone2
        • I cross-check with the REGISTRY name "Caverphone", which should also return the new Caverphone2 encoder, but without reflection
        Show
        Uwe Schindler added a comment - Updated patch, I added a testcase for the reflection without package name: I use "Caverphone2" as package less name, which is not in the REGISTRY. The returned class should be Caverphone2 I cross-check with the REGISTRY name "Caverphone", which should also return the new Caverphone2 encoder, but without reflection
        Hide
        Uwe Schindler added a comment -

        Attached is a patch that fixes the initialization bug and improves the reflection code.

        One thing that came into my mind when I looked at the code of PhoneticFilter: I did not find out if Encoders are threadsafe or not. If they aren't or we are not sure (the documentation states nothing), we should create the Encoder class instance in the TokenStream create() method. TokenStreams itsself are only used per thread (iterator pattern), and the Analyzer reuse ensures that we have a separate instance for each thread.

        Show
        Uwe Schindler added a comment - Attached is a patch that fixes the initialization bug and improves the reflection code. One thing that came into my mind when I looked at the code of PhoneticFilter: I did not find out if Encoders are threadsafe or not. If they aren't or we are not sure (the documentation states nothing), we should create the Encoder class instance in the TokenStream create() method. TokenStreams itsself are only used per thread (iterator pattern), and the Analyzer reuse ensures that we have a separate instance for each thread.
        Hide
        Uwe Schindler added a comment -

        Hi David,

        looks good, but there is a bug:
        The refactoring of the REGISTRY static map was an anonymous inner class before (new HashMap() {{ hashmap ctor code }}; (please note double parens). Now it is assigned to a static field, but the initializer code is an inline ctor of the factory, means the initialization runs on every instantiation.

        • add static before the anonymous ctor:
          private static final Map<String, Class<? extends Encoder>> registry = new HashMap<String, Class<? extends Encoder>>(6);
          static {
           registry.put(...
          
        • or leave the initializer as it was before (anonymous HashMap subclass with overridden ctor).

        in resolve encoder maybe remove the clazz variable and directly return the result of forName(). I do't like non-final variables initialized with null because that prevents compilation failures on missing initialization and null could be returned - especially with lots of try-catch blocks.

        Show
        Uwe Schindler added a comment - Hi David, looks good, but there is a bug: The refactoring of the REGISTRY static map was an anonymous inner class before (new HashMap() {{ hashmap ctor code }}; (please note double parens). Now it is assigned to a static field, but the initializer code is an inline ctor of the factory, means the initialization runs on every instantiation. add static before the anonymous ctor: private static final Map< String , Class <? extends Encoder>> registry = new HashMap< String , Class <? extends Encoder>>(6); static { registry.put(... or leave the initializer as it was before (anonymous HashMap subclass with overridden ctor). in resolve encoder maybe remove the clazz variable and directly return the result of forName(). I do't like non-final variables initialized with null because that prevents compilation failures on missing initialization and null could be returned - especially with lots of try-catch blocks.
        Hide
        David Smiley added a comment -

        Here is an updated patch. I enhanced the javadocs too, which have become out of date.

        Show
        David Smiley added a comment - Here is an updated patch. I enhanced the javadocs too, which have become out of date.
        Hide
        Uwe Schindler added a comment -

        Thanks! Thaks for taking care, as I have no time to provide a patch at the moment

        but I will since it seems to be a big deal to you (three exclamation points)

        It's not so important, if it's private to the class! I just don't want public code to expose Collections not intended to be modified in a modifiable way, so i am fine with or without unmodifiable. There are also places in Lucene violating this (or use public arrays, which cannot be protected - like CompositeReader.getSequentialSubReaders()).

        Show
        Uwe Schindler added a comment - Thanks! Thaks for taking care, as I have no time to provide a patch at the moment but I will since it seems to be a big deal to you (three exclamation points) It's not so important, if it's private to the class! I just don't want public code to expose Collections not intended to be modified in a modifiable way, so i am fine with or without unmodifiable. There are also places in Lucene violating this (or use public arrays, which cannot be protected - like CompositeReader.getSequentialSubReaders()).
        Hide
        David Smiley added a comment -

        Rob: This is a (minor) thread-safety bug and consequently it's not really feasible to test it without a lot of work and without a guarantee at the end that there is no problem, and so it isn't worth it. Of course if you want to; go for it

        Uwe: Thanks very much for your comments. I wasn't sure what the lifecycle of this class was or if/how it is shared; I looked at the javadocs of TokenFilterFactory just now and I think its clearer. Given that the Factory's init() method is not going to be called frequently, it seems to me that the class name based lookups need not cache at all. And I also like your suggestion of improving the fallback lookup by looking for a '.'. BTW I considered wrapping with unmodifiableMap but didn't bother because it's not exposed outside of this class, which is fairly small too, but I will since it seems to be a big deal to you (three exclamation points). I'll propose another patch later

        ~ David

        Show
        David Smiley added a comment - Rob: This is a (minor) thread-safety bug and consequently it's not really feasible to test it without a lot of work and without a guarantee at the end that there is no problem, and so it isn't worth it. Of course if you want to; go for it Uwe: Thanks very much for your comments. I wasn't sure what the lifecycle of this class was or if/how it is shared; I looked at the javadocs of TokenFilterFactory just now and I think its clearer. Given that the Factory's init() method is not going to be called frequently, it seems to me that the class name based lookups need not cache at all. And I also like your suggestion of improving the fallback lookup by looking for a '.'. BTW I considered wrapping with unmodifiableMap but didn't bother because it's not exposed outside of this class, which is fairly small too, but I will since it seems to be a big deal to you (three exclamation points). I'll propose another patch later ~ David
        Hide
        Uwe Schindler added a comment -

        One thing to add:

        the cache only caches "resolves" not instantiations of encoders, the JVM caches Class.forName() lookups, so why cache it again?

        You may argue, that the lookup may be more expensive, as it uses a try..catch block (first try in expected commons package, later try a full class name). I think the chain of Try...Catch with ClassNoFound should be changed to do it more smart: If the name of encoder contains a period (indexOf('.')>=0), look it up as class name, otherwise prefix it with the commons package name. This way, the JVM cache for loaded classes can be used and the cache is completely useless.

        I like in your fix, that it also changed the broken uppercasing: It should only do that for the builtins, class names itsself are case sensitive.

        +1 to remove the cache and only keep the static builtins (and please: as Collections.unmodifiableMap!!!), lookup non-builtins without try...catch(ClassNotFound). The error message should only list the builtins and mention that otherwise the name should be a full class name.

        Show
        Uwe Schindler added a comment - One thing to add: the cache only caches "resolves" not instantiations of encoders, the JVM caches Class.forName() lookups, so why cache it again? You may argue, that the lookup may be more expensive, as it uses a try..catch block (first try in expected commons package, later try a full class name). I think the chain of Try...Catch with ClassNoFound should be changed to do it more smart: If the name of encoder contains a period (indexOf('.')>=0), look it up as class name, otherwise prefix it with the commons package name. This way, the JVM cache for loaded classes can be used and the cache is completely useless. I like in your fix, that it also changed the broken uppercasing: It should only do that for the builtins, class names itsself are case sensitive. +1 to remove the cache and only keep the static builtins (and please: as Collections.unmodifiableMap!!!), lookup non-builtins without try...catch(ClassNotFound). The error message should only list the builtins and mention that otherwise the name should be a full class name.
        Hide
        Uwe Schindler added a comment -

        The whole design of this cache is wrong:

        • it should not be static, the resolved class lookups should only be cached per instance
        • the cache only caches "resolves" not instantiations of encoders, the JVM caches Class.forName() lookups, so why cache it again?
        • To fix the orginal issue, a ConcurrentHashMap would also be fine
        • Solr/Lucene use Analyzers which are now enforced to reuse tokenstreams, so the lookup is only done once when IndexWriter starts and a new indexing thread is created

        The last point tells me: "remove this cache"

        Show
        Uwe Schindler added a comment - The whole design of this cache is wrong: it should not be static, the resolved class lookups should only be cached per instance the cache only caches "resolves" not instantiations of encoders, the JVM caches Class.forName() lookups, so why cache it again? To fix the orginal issue, a ConcurrentHashMap would also be fine Solr/Lucene use Analyzers which are now enforced to reuse tokenstreams, so the lookup is only done once when IndexWriter starts and a new indexing thread is created The last point tells me: "remove this cache"
        Hide
        Robert Muir added a comment -

        where is the test for the bug?

        why does this factory have a cache at all?!

        Show
        Robert Muir added a comment - where is the test for the bug? why does this factory have a cache at all?!
        Hide
        David Smiley added a comment -

        And I also noticed that Commons-Codec's Caverphone.class was deprecated for Caverphone2.class so I made that simple change. The docs say the deprecated one simply forwards calls, so there should be no side-effects of this change.

        Show
        David Smiley added a comment - And I also noticed that Commons-Codec's Caverphone.class was deprecated for Caverphone2.class so I made that simple change. The docs say the deprecated one simply forwards calls, so there should be no side-effects of this change.
        Hide
        David Smiley added a comment -

        The attached patch fixes the problem by basically "upgrading" the code to use Guava's MapMaker which is excellent for caches. I keep 2 distinct registries, the constant builtin one, and the custom class one, because there is a distinction between the two which is the capitalization of the keys. I couldn't have just one MapMaker map because the key lookup of the computed value should be the original class name uncapitalized.

        Show
        David Smiley added a comment - The attached patch fixes the problem by basically "upgrading" the code to use Guava's MapMaker which is excellent for caches. I keep 2 distinct registries, the constant builtin one, and the custom class one, because there is a distinction between the two which is the capitalization of the keys. I couldn't have just one MapMaker map because the key lookup of the computed value should be the original class name uncapitalized.

          People

          • Assignee:
            David Smiley
            Reporter:
            David Smiley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development