Lucene - Core
  1. Lucene - Core
  2. LUCENE-3508

Decompounders based on CompoundWordTokenFilterBase cannot be used with custom attributes

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.4, 4.0-ALPHA
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      The CompoundWordTokenFilterBase.setToken method will call clearAttributes() and then will reset only the default Token attributes (term, position, flags, etc) resulting in any custom attributes losing their value. Commenting out clearAttributes() seems to do the trick, but will fail the TestCompoundWordTokenFilter tests..

      1. LUCENE-3508.patch
        32 kB
        Uwe Schindler
      2. LUCENE-3508.patch
        22 kB
        Uwe Schindler
      3. LUCENE-3508.patch
        22 kB
        Uwe Schindler
      4. LUCENE-3508.patch
        20 kB
        Uwe Schindler
      5. LUCENE-3508.patch
        16 kB
        Spyros Kapnissis

        Activity

        Hide
        Spyros Kapnissis added a comment -

        I am attaching a patch that uses the TokenStream API instead of Token for the decompounders (CompoundWordTokenFilterBase, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter).Non-common attributes are now being passed on correctly through the analyzer chain. unit test of the functionality is included.

        Hope it helps..any improvements are of course welcome.

        Show
        Spyros Kapnissis added a comment - I am attaching a patch that uses the TokenStream API instead of Token for the decompounders (CompoundWordTokenFilterBase, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter).Non-common attributes are now being passed on correctly through the analyzer chain. unit test of the functionality is included. Hope it helps..any improvements are of course welcome.
        Hide
        Uwe Schindler added a comment -

        Hi Spyros,
        thanks for taking care. The patch looks correct, I have some minor recommendations (I will post them later).
        I will report back soon.

        Show
        Uwe Schindler added a comment - Hi Spyros, thanks for taking care. The patch looks correct, I have some minor recommendations (I will post them later). I will report back soon.
        Hide
        Robert Muir added a comment -

        I agree, lets fix this!

        Are there any other tokenstreams that don't yet work correctly with the attributes API?
        I was surprised there was no warning in the javadocs for this one.
        I thought I remembered us adding a warning, maybe that was ShingleMatrix though.

        Show
        Robert Muir added a comment - I agree, lets fix this! Are there any other tokenstreams that don't yet work correctly with the attributes API? I was surprised there was no warning in the javadocs for this one. I thought I remembered us adding a warning, maybe that was ShingleMatrix though.
        Hide
        Uwe Schindler added a comment -

        ShingleMatrix is gone in 4.0 and deprecated in 3.x. There may be other filters with the same problem, maybe we should look for uses of Token class (which MUST DIE DIE DIE!) in the analyzers.

        Show
        Uwe Schindler added a comment - ShingleMatrix is gone in 4.0 and deprecated in 3.x. There may be other filters with the same problem, maybe we should look for uses of Token class (which MUST DIE DIE DIE!) in the analyzers.
        Hide
        Uwe Schindler added a comment -

        Attached you will find a new patch for trunk. I made some improvements to the copy operations and CompoundTokenClass:

        • copy operations no longer create useless String objects or clones of String's internal char[] (this slows down indexing a lot)
        • the algorithmic hyphenator uses CTA's char[] directly as it did for Token before (see above) and uses optimized append()
        • the broken non-unicode-conform lowercasing was removed, instead, the CharArraySet is created case insensitive. If you pass in an own CharArraySet, it has to be case insensitive, if not, filter will fail (what to do? Robert, how do we handle that otherwise?)
        • As all tokens are again CTAs, the CAS lookup is fast again.
        • Some whitespace cleanup in the test and relicts in base source file (Lucene requires 2 spaces, no tabs)

        Robert, if you could look into it, it would be great. I did not test it with Solr, but for me it looks correct.

        Uwe

        Show
        Uwe Schindler added a comment - Attached you will find a new patch for trunk. I made some improvements to the copy operations and CompoundTokenClass: copy operations no longer create useless String objects or clones of String's internal char[] (this slows down indexing a lot) the algorithmic hyphenator uses CTA's char[] directly as it did for Token before (see above) and uses optimized append() the broken non-unicode-conform lowercasing was removed, instead, the CharArraySet is created case insensitive. If you pass in an own CharArraySet, it has to be case insensitive, if not, filter will fail (what to do? Robert, how do we handle that otherwise?) As all tokens are again CTAs, the CAS lookup is fast again. Some whitespace cleanup in the test and relicts in base source file (Lucene requires 2 spaces, no tabs) Robert, if you could look into it, it would be great. I did not test it with Solr, but for me it looks correct. Uwe
        Hide
        Uwe Schindler added a comment -

        More cleanup:

        • As original token is always preserved, is not put into the list at all and returned without modifying (no extra copy operations)
        • removed deprecated makeDictionary method and corrected matchVersion usage.
        Show
        Uwe Schindler added a comment - More cleanup: As original token is always preserved, is not put into the list at all and returned without modifying (no extra copy operations) removed deprecated makeDictionary method and corrected matchVersion usage.
        Hide
        Uwe Schindler added a comment -

        One more time the filter was revisited and partly rewritten:

        • it no longer clones the orginal token, as decompounding is done when TokenStream is on this token, which does not change. The decompounder simply takes termAtt/offsetAtt and produces new CompoundToken instances out of it, added to the LinkedList. The original is returned unmodified by a simple "return true". This filter actually only creates new opjects when compounds are found, all other tokens are passed as is.
        • CompoundToken is now a simple wrapper around the characters and the offsets, copied out of the unmodified original termAtt.

        I think thats the most effective implementation of this filters. I think it's ready to commit.

        Show
        Uwe Schindler added a comment - One more time the filter was revisited and partly rewritten: it no longer clones the orginal token, as decompounding is done when TokenStream is on this token, which does not change. The decompounder simply takes termAtt/offsetAtt and produces new CompoundToken instances out of it, added to the LinkedList. The original is returned unmodified by a simple "return true". This filter actually only creates new opjects when compounds are found, all other tokens are passed as is. CompoundToken is now a simple wrapper around the characters and the offsets, copied out of the unmodified original termAtt. I think thats the most effective implementation of this filters. I think it's ready to commit.
        Hide
        Robert Muir added a comment -

        Just one idea: if the base has makeDictionary(String[]), then maybe deprecate-3x-remove-trunk the stupid string[] ctors and just take the CharArraySet?

        I think this would remove about half the ctors in both base and subclasses, and I think these ctors are stupid myself.

        Otherwise, looks great!

        Show
        Robert Muir added a comment - Just one idea: if the base has makeDictionary(String[]), then maybe deprecate-3x-remove-trunk the stupid string[] ctors and just take the CharArraySet? I think this would remove about half the ctors in both base and subclasses, and I think these ctors are stupid myself. Otherwise, looks great!
        Hide
        Uwe Schindler added a comment -

        Robert: I agree. I think, I will do this in a separate issue, to get those changes first into 3.x (by merging) and then deprecate/remove all this later. Do we have other StopFilters and similar classes that have those String[] ctors?

        I will also add a javadocs info to the ctor: "This filter does many lookups on the dictionary, so it is recommended for optimal performance to add a LowercaseFilter before this filter and make the CharArraySet passed as dictionary case insensitive."

        I would like to add this as a hint, as this filter does per term lots of lookups in the Set.

        Show
        Uwe Schindler added a comment - Robert: I agree. I think, I will do this in a separate issue, to get those changes first into 3.x (by merging) and then deprecate/remove all this later. Do we have other StopFilters and similar classes that have those String[] ctors? I will also add a javadocs info to the ctor: "This filter does many lookups on the dictionary, so it is recommended for optimal performance to add a LowercaseFilter before this filter and make the CharArraySet passed as dictionary case insensitive." I would like to add this as a hint, as this filter does per term lots of lookups in the Set.
        Hide
        Uwe Schindler added a comment -

        Final trunk patch, which gets committed now:

        • Changed the CompoundToken class to use CharSequence for holding the slices.
        • Added javadocs with recommendations for performance.
        • Deprecated all String[] methods.

        Once this is committed, I will backport and then in a second step remove the deprecated methods in trunk.

        Show
        Uwe Schindler added a comment - Final trunk patch, which gets committed now: Changed the CompoundToken class to use CharSequence for holding the slices. Added javadocs with recommendations for performance. Deprecated all String[] methods. Once this is committed, I will backport and then in a second step remove the deprecated methods in trunk.
        Hide
        Uwe Schindler added a comment -

        Committed trunk revision: 1188597

        Show
        Uwe Schindler added a comment - Committed trunk revision: 1188597
        Hide
        Uwe Schindler added a comment -

        Committed 3.x revision: 1188604

        Show
        Uwe Schindler added a comment - Committed 3.x revision: 1188604
        Hide
        Uwe Schindler added a comment -

        Finally also removed deprecations, so this issue is resolved.

        Thanks Spyros for reporting this and making the initial patch!

        Show
        Uwe Schindler added a comment - Finally also removed deprecations, so this issue is resolved. Thanks Spyros for reporting this and making the initial patch!
        Hide
        Spyros Kapnissis added a comment -

        Looks great now, thanks a lot guys!

        Show
        Spyros Kapnissis added a comment - Looks great now, thanks a lot guys!
        Hide
        Uwe Schindler added a comment -

        Bulk close after release of 3.5

        Show
        Uwe Schindler added a comment - Bulk close after release of 3.5

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Spyros Kapnissis
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development