Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The ICU Collators (unlike the JDK ones) aren't thread safe: http://userguide.icu-project.org/collation/architecture , a little non-obvious since its not mentioned
      in the javadocs, and its not clear if the docs apply to only the C code, but i looked
      at the source and there is all kinds of internal state.

      So in my opinion, we should clone the icu collators (which are passed in from the outside)
      when creating a new TokenStream/AttributeImpl to prevent problems. This shouldn't be a big
      deal since everything uses reusableTokenStream anyway.

      1. LUCENE-2943.patch
        9 kB
        Robert Muir
      2. LUCENE-2943.patch
        3 kB
        Robert Muir
      3. LUCENE-2943.patch
        2 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here's a patch for trunk (including fix to the deprecated 3.x filter).

        We should fix this bug in 3.x too.

        Show
        Robert Muir added a comment - Here's a patch for trunk (including fix to the deprecated 3.x filter). We should fix this bug in 3.x too.
        Hide
        Uwe Schindler added a comment -

        Patch looks fine!

        Just one question: is the clone() method itsself thread-safe? if not, it must be synchronized somehow.

        Show
        Uwe Schindler added a comment - Patch looks fine! Just one question: is the clone() method itsself thread-safe? if not, it must be synchronized somehow.
        Hide
        Uwe Schindler added a comment - - edited

        I changed my mind a little bit:

        The cloning of the Collator should be done in the Analyzer not in the Filter. The same applies to the AttributeImpl, the cloning should not be done in the ctor. The problem is not that the TokenStream or the Attribute instance may reuse the attribute in different threads, the problem is that the factory class (the Analyzer) does reuse the Collator in different threads when it produces multiple tokenstreams or the AF multiple attributes.

        This is a slight difference, because the following code is always safe:
        new CollationFilter(Collator.newInstance(lang)), cloning would be wrong.

        The reason for the whole thing: TokenStream and Attribute instances itsself are single-threaded only, but not the factory or the analyzer.

        Show
        Uwe Schindler added a comment - - edited I changed my mind a little bit: The cloning of the Collator should be done in the Analyzer not in the Filter. The same applies to the AttributeImpl, the cloning should not be done in the ctor. The problem is not that the TokenStream or the Attribute instance may reuse the attribute in different threads, the problem is that the factory class (the Analyzer) does reuse the Collator in different threads when it produces multiple tokenstreams or the AF multiple attributes. This is a slight difference, because the following code is always safe: new CollationFilter(Collator.newInstance(lang)), cloning would be wrong. The reason for the whole thing: TokenStream and Attribute instances itsself are single-threaded only, but not the factory or the analyzer.
        Hide
        Robert Muir added a comment -

        Uwe, that is one alternative.

        The only reason i did it this way, is because I felt it was a bit of a trap (to any users using the Filter directly). This is because JDK collators are in fact thread safe.

        This is a slight difference, because the following code is always safe:
        new CollationFilter(Collator.newInstance(lang)), cloning would be wrong.

        I don't think this is really a reasonable example, usually in the search engine you would never use code like this: the sort keys will be way too large for no reason. For example usually its the case you will set something more reasonable like primary strength (case-insensitive).

        Because the clone is cheap, and this is a trap to users, I'm still going to fight for my original patch. This way the ICU and JDK functionality behave consistently from a user perspective.

        Show
        Robert Muir added a comment - Uwe, that is one alternative. The only reason i did it this way, is because I felt it was a bit of a trap (to any users using the Filter directly). This is because JDK collators are in fact thread safe. This is a slight difference, because the following code is always safe: new CollationFilter(Collator.newInstance(lang)), cloning would be wrong. I don't think this is really a reasonable example, usually in the search engine you would never use code like this: the sort keys will be way too large for no reason. For example usually its the case you will set something more reasonable like primary strength (case-insensitive). Because the clone is cheap, and this is a trap to users, I'm still going to fight for my original patch. This way the ICU and JDK functionality behave consistently from a user perspective.
        Hide
        Uwe Schindler added a comment -

        I would be fine with both solutions, for me it just looks wrong that way, but for safety reasons its fine.

        Show
        Uwe Schindler added a comment - I would be fine with both solutions, for me it just looks wrong that way, but for safety reasons its fine.
        Hide
        Robert Muir added a comment -

        Uwe, i can agree it looks a little wrong, but it makes the 'reusable' case easier.

        the example you gave is the slow non-reusable case... honestly i'm not very worried about making this slower... its already slow.

        if we are to put responsibility on the user to pass Collator clones to each TokenFilter, it will make reusing more difficult (e.g. custom analyzers).

        Again, the big trap is that usually you see "WARNING THIS CLASS IS NOT THREAD SAFE" but the icu javadoc doesn't really say that, you have to instead read this general architecture document... so I think by pushing the responsibility to the user, there would be lots of bugs (if anyone makes a custom analyzer/factory/etc).

        Show
        Robert Muir added a comment - Uwe, i can agree it looks a little wrong, but it makes the 'reusable' case easier. the example you gave is the slow non-reusable case... honestly i'm not very worried about making this slower... its already slow. if we are to put responsibility on the user to pass Collator clones to each TokenFilter, it will make reusing more difficult (e.g. custom analyzers). Again, the big trap is that usually you see "WARNING THIS CLASS IS NOT THREAD SAFE" but the icu javadoc doesn't really say that, you have to instead read this general architecture document... so I think by pushing the responsibility to the user, there would be lots of bugs (if anyone makes a custom analyzer/factory/etc).
        Hide
        Robert Muir added a comment -

        Updated patch: I clone the JDK ones too.

        Its not really the case that these are "thread-safe", instead really what happens is their methods are synced (if they are correct). So its good to clone to reduce contention.

        Also i took a look at harmony, which simply wraps the ICU impl and doesn't sync, which means its not thread safe (but should be).

        For these reasons I think we should just clone the JDK one for safety too.

        Show
        Robert Muir added a comment - Updated patch: I clone the JDK ones too. Its not really the case that these are "thread-safe", instead really what happens is their methods are synced (if they are correct). So its good to clone to reduce contention. Also i took a look at harmony, which simply wraps the ICU impl and doesn't sync, which means its not thread safe (but should be). For these reasons I think we should just clone the JDK one for safety too.
        Hide
        Robert Muir added a comment -

        i added a testThreadSafe for this issue (which fails without the fix)

        Show
        Robert Muir added a comment - i added a testThreadSafe for this issue (which fails without the fix)
        Hide
        Robert Muir added a comment -

        Committed revisions 1075850, 1076017 (branch_3x)

        Show
        Robert Muir added a comment - Committed revisions 1075850, 1076017 (branch_3x)
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development