Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      TeeSinkTokenFilter can be improved in several ways, as it's written today:

      The most major one is removing SinkFilter which just doesn't work and is confusing. E.g., if you set a SinkFilter which filters tokens, the attributes on the stream such as PositionIncrementAttribute are not updated. Also, if you update any attribute on the stream, you affect other SinkStreams ... It's best if we remove this confusing class, and let consumers reuse existing TokenFilters by chaining them to the sink stream.

      After we do that, we can make all the cached states a single (immutable) list, which is shared between all the sink streams, so we don't need to keep many references around, and also deal with WeakReference.

      Besides that there are some other minor improvements to the code that will come after we clean up this class.

      From a backwards-compatibility standpoint, I don't think that SinkFilter is actually used anywhere (since it just ... confusing and doesn't work as expected), and therefore I believe it won't affect anyone. If however someone did implement a SinkFilter, it should be trivial to convert it to a TokenFilter and chain it to the SinkStream.

      1. LUCENE-6973.patch
        51 kB
        Shai Erera
      2. LUCENE-6973.patch
        49 kB
        Shai Erera
      3. LUCENE-6973.patch
        48 kB
        Shai Erera
      4. LUCENE-6973.patch
        63 kB
        Shai Erera
      5. LUCENE-6973.patch
        64 kB
        Shai Erera
      6. LUCENE-6973.patch
        64 kB
        Shai Erera
      7. LUCENE-6973.patch
        65 kB
        Shai Erera
      8. LUCENE-6973.patch
        65 kB
        Shai Erera

        Activity

        Hide
        Paul Elschot added a comment -

        At LUCENE-5687 there is a PrefillTokenStream class that is basically a superclass extracted from TeeSinkTokenFilter.
        I'm using PrefillTokenStream objects as targets for splitting a token stream into multiple streams.

        Iirc tried to use SinkFilter at first, but it was easier to extract the superclass and avoid SinkFilter.
        So I would not mind losing SinkFilter.

        Show
        Paul Elschot added a comment - At LUCENE-5687 there is a PrefillTokenStream class that is basically a superclass extracted from TeeSinkTokenFilter. I'm using PrefillTokenStream objects as targets for splitting a token stream into multiple streams. Iirc tried to use SinkFilter at first, but it was easier to extract the superclass and avoid SinkFilter. So I would not mind losing SinkFilter.
        Hide
        Paul Elschot added a comment -

        After another look at the code, I think PrefillTokenStream could be a nice starting point to improve TeeSinkTokenFilter.

        Show
        Paul Elschot added a comment - After another look at the code, I think PrefillTokenStream could be a nice starting point to improve TeeSinkTokenFilter.
        Hide
        Uwe Schindler added a comment -

        Hi, I agree with Shai's proposal! So +1 to fix this.

        Show
        Uwe Schindler added a comment - Hi, I agree with Shai's proposal! So +1 to fix this.
        Hide
        Shai Erera added a comment -

        Patch implements the proposed changes.

        Show
        Shai Erera added a comment - Patch implements the proposed changes.
        Hide
        Paul Elschot added a comment -

        The patch has changes to DateRecognizerFilter.java, but I don't see that in trunk.
        There is this in trunk: lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/DateRecognizerSinkFilter.java

        For which branch is the patch?

        Show
        Paul Elschot added a comment - The patch has changes to DateRecognizerFilter.java, but I don't see that in trunk. There is this in trunk: lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/DateRecognizerSinkFilter.java For which branch is the patch?
        Hide
        Shai Erera added a comment -

        This patch is for trunk. I renamed DateRecognizerSinkFilter to DateRacognizerFilter. When I do svn stat I get this:

        A  +    lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilter.java
                > moved from lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/DateRecognizerSinkFilter.java
        
        Show
        Shai Erera added a comment - This patch is for trunk. I renamed DateRecognizerSinkFilter to DateRacognizerFilter . When I do svn stat I get this: A + lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilter.java > moved from lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/DateRecognizerSinkFilter.java
        Hide
        Uwe Schindler added a comment -

        When you execute svn diff, add the option to show renames as adds: --show-copies-as-adds

        Show
        Uwe Schindler added a comment - When you execute svn diff, add the option to show renames as adds: --show-copies-as-adds
        Hide
        Shai Erera added a comment -

        Patch w/ --show-copies-as-adds

        Show
        Shai Erera added a comment - Patch w/ --show-copies-as-adds
        Hide
        Paul Elschot added a comment -

        Ok, somehow I can't cleanly apply the patch, neither with patch, nor with git, nor with svn.

        But as AFAICT in TeeSinkTokenFilter this uses a ArrayDeque instead of an ArrayList and it drops the accept method. That was in the way for PrefillTokenStream too, which is why I needed to extract the superclass. I'm fine with dropping that method.

        I get the best result with svn, which gives this output::

        svn patch ~/patches/LUCENE-6973.patch
        U         lucene/CHANGES.txt
        Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilter.java'
        A         lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilterFactory.java
        Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenRangeFilter.java'
        A         lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenRangeFilterFactory.java
        Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenTypeFilter.java'
        A         lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenTypeFilterFactory.java
        D         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/DateRecognizerSinkFilter.java
        U         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.java
        D         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TokenRangeSinkFilter.java
        D         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TokenTypeSinkFilter.java
        U         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/package-info.java
        U         lucene/analysis/common/src/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory
        Skipped missing target: 'lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/DateRecognizerTokenFilterTest.java'
        Skipped missing target: 'lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TokenRangeFilterTest.java'
        Skipped missing target: 'lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TokenTypeFilterTest.java'
        D         lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/DateRecognizerSinkTokenizerTest.java
        U         lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TestTeeSinkTokenFilter.java
        D         lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TokenRangeSinkTokenizerTest.java
        D         lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TokenTypeSinkTokenizerTest.java
        U         lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java
        Summary of conflicts:
          Skipped paths: 6
        

        I'm probably doing something wrong here, but please go ahead as you see fit.

        Show
        Paul Elschot added a comment - Ok, somehow I can't cleanly apply the patch, neither with patch, nor with git, nor with svn. But as AFAICT in TeeSinkTokenFilter this uses a ArrayDeque instead of an ArrayList and it drops the accept method. That was in the way for PrefillTokenStream too, which is why I needed to extract the superclass. I'm fine with dropping that method. I get the best result with svn, which gives this output:: svn patch ~/patches/LUCENE-6973.patch U lucene/CHANGES.txt Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilter.java' A lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilterFactory.java Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenRangeFilter.java' A lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenRangeFilterFactory.java Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenTypeFilter.java' A lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenTypeFilterFactory.java D lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/DateRecognizerSinkFilter.java U lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.java D lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TokenRangeSinkFilter.java D lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TokenTypeSinkFilter.java U lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/package-info.java U lucene/analysis/common/src/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory Skipped missing target: 'lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/DateRecognizerTokenFilterTest.java' Skipped missing target: 'lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TokenRangeFilterTest.java' Skipped missing target: 'lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TokenTypeFilterTest.java' D lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/DateRecognizerSinkTokenizerTest.java U lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TestTeeSinkTokenFilter.java D lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TokenRangeSinkTokenizerTest.java D lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TokenTypeSinkTokenizerTest.java U lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java Summary of conflicts: Skipped paths: 6 I'm probably doing something wrong here, but please go ahead as you see fit.
        Hide
        Paul Elschot added a comment -

        With the 2nd patch I get this output:

        svn patch ~/patches/LUCENE-6973b.patch
        U         lucene/CHANGES.txt
        A         lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilter.java
        Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilterFactory.java'
        A         lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenRangeFilter.java
        Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenRangeFilterFactory.java'
        A         lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenTypeFilter.java
        Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenTypeFilterFactory.java'
        D         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/DateRecognizerSinkFilter.java
        U         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.java
        D         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TokenRangeSinkFilter.java
        D         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TokenTypeSinkFilter.java
        U         lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/package-info.java
        U         lucene/analysis/common/src/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory
        A         lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/DateRecognizerTokenFilterTest.java
        A         lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TokenRangeFilterTest.java
        A         lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TokenTypeFilterTest.java
        D         lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/DateRecognizerSinkTokenizerTest.java
        U         lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TestTeeSinkTokenFilter.java
        D         lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TokenRangeSinkTokenizerTest.java
        D         lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TokenTypeSinkTokenizerTest.java
        U         lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java
        Summary of conflicts:
          Skipped paths: 3
        

        That's only half of the earlier skipped paths.
        But I'm only interested in TeeSinkTokenFilter, and that is definitely improved.

        Show
        Paul Elschot added a comment - With the 2nd patch I get this output: svn patch ~/patches/LUCENE-6973b.patch U lucene/CHANGES.txt A lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilter.java Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilterFactory.java' A lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenRangeFilter.java Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenRangeFilterFactory.java' A lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenTypeFilter.java Skipped missing target: 'lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/TokenTypeFilterFactory.java' D lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/DateRecognizerSinkFilter.java U lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.java D lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TokenRangeSinkFilter.java D lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/TokenTypeSinkFilter.java U lucene/analysis/common/src/java/org/apache/lucene/analysis/sinks/package-info.java U lucene/analysis/common/src/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory A lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/DateRecognizerTokenFilterTest.java A lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TokenRangeFilterTest.java A lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TokenTypeFilterTest.java D lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/DateRecognizerSinkTokenizerTest.java U lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TestTeeSinkTokenFilter.java D lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TokenRangeSinkTokenizerTest.java D lucene/analysis/common/src/test/org/apache/lucene/analysis/sinks/TokenTypeSinkTokenizerTest.java U lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java Summary of conflicts: Skipped paths: 3 That's only half of the earlier skipped paths. But I'm only interested in TeeSinkTokenFilter, and that is definitely improved.
        Hide
        Paul Elschot added a comment -

        Btw. in case there is preference for ArrayDeque over ArrayList, CachingTokenFilter could also be changed for that.
        See also LUCENE-6033, first patch.

        Show
        Paul Elschot added a comment - Btw. in case there is preference for ArrayDeque over ArrayList, CachingTokenFilter could also be changed for that. See also LUCENE-6033 , first patch.
        Hide
        Uwe Schindler added a comment -

        Btw. in case there is preference for ArrayDeque over ArrayList, CachingTokenFilter could also be changed for that.

        I don't think we need a Deque here or there. We only add elements at the end, so ArrayList is fine, also for the States here. ArrayDeque is only better as replacement for a LinkedList, where you want to remove or insert elements at start of list (where a LinkedList is fine, but ArrayList involves copying on each insert/delete). But LinkedList was also wrong for the use-case here. We just add elements at the end and it gets unmodifiable at end. The iterator is happy, too. ArrayList is also slightly faster because it does not need to check 2 bounds.

        Shai: The Javadocs of the SinkFilter replacements are partly misleading (they still talk about sinks, although it is no longer related to TeeSink).

        I will further look into this patch tomorrow, this was just my first thoughts.

        Show
        Uwe Schindler added a comment - Btw. in case there is preference for ArrayDeque over ArrayList, CachingTokenFilter could also be changed for that. I don't think we need a Deque here or there. We only add elements at the end, so ArrayList is fine, also for the States here. ArrayDeque is only better as replacement for a LinkedList, where you want to remove or insert elements at start of list (where a LinkedList is fine, but ArrayList involves copying on each insert/delete). But LinkedList was also wrong for the use-case here. We just add elements at the end and it gets unmodifiable at end. The iterator is happy, too. ArrayList is also slightly faster because it does not need to check 2 bounds. Shai: The Javadocs of the SinkFilter replacements are partly misleading (they still talk about sinks, although it is no longer related to TeeSink). I will further look into this patch tomorrow, this was just my first thoughts.
        Hide
        Shai Erera added a comment -

        Patch with following changes:

        • Change from Deque to List. Uwe, I thought that in our discussions you mentioned that you prefer ArrayDeque, even though I agree that in this case, since we don't remove any elements, I don't think it has benefits over simple ArrayList).
        • Change the new filters unit tests to not use TeeSinkTokenFilter at all, as it's unrelated anymore.

        Uwe, besides that, I didn't find any javadocs that refer to SinkFilter, can you please cite an example?

        Show
        Shai Erera added a comment - Patch with following changes: Change from Deque to List. Uwe, I thought that in our discussions you mentioned that you prefer ArrayDeque, even though I agree that in this case, since we don't remove any elements, I don't think it has benefits over simple ArrayList). Change the new filters unit tests to not use TeeSinkTokenFilter at all, as it's unrelated anymore. Uwe, besides that, I didn't find any javadocs that refer to SinkFilter, can you please cite an example?
        Hide
        Paul Elschot added a comment -

        2nd and 3rd patch apply cleanly now, before applying the 2nd patch I had forgotten to remove the earlier added files that were left by

        svn revert -R .

        PrefillTokenStream of LUCENE-5687 has a different constructor argument (AttributeSource), so it is not easily replaced by the TeeSinkTokenFilter here.
        I'll deal with that later there.

        Show
        Paul Elschot added a comment - 2nd and 3rd patch apply cleanly now, before applying the 2nd patch I had forgotten to remove the earlier added files that were left by svn revert -R . PrefillTokenStream of LUCENE-5687 has a different constructor argument (AttributeSource), so it is not easily replaced by the TeeSinkTokenFilter here. I'll deal with that later there.
        Hide
        Shai Erera added a comment -

        I missed the mention of 'sink' in DateRecognizerFilter.

        Show
        Shai Erera added a comment - I missed the mention of 'sink' in DateRecognizerFilter .
        Hide
        Shai Erera added a comment -

        I think that's ready for commit. Uwe Schindler if you don't have any more comments, I'll go ahead.

        Show
        Shai Erera added a comment - I think that's ready for commit. Uwe Schindler if you don't have any more comments, I'll go ahead.
        Hide
        Uwe Schindler added a comment -

        STOP: I think you added some missing TokenFilters to wrong SPI file! lucene/analysis/common/src/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory

        Should be in its own JAR file for every module.

        Show
        Uwe Schindler added a comment - STOP: I think you added some missing TokenFilters to wrong SPI file! lucene/analysis/common/src/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory Should be in its own JAR file for every module.
        Hide
        Uwe Schindler added a comment -

        Can you simple revert the addition of stempel (which is already there in its own JAR file) and phonetic filters. They are not part of analysis-common!

        The test fails currently on Eclipse, because classpath is setup in wrong way.

        Show
        Uwe Schindler added a comment - Can you simple revert the addition of stempel (which is already there in its own JAR file) and phonetic filters. They are not part of analysis-common! The test fails currently on Eclipse, because classpath is setup in wrong way.
        Hide
        Uwe Schindler added a comment -

        org.apache.lucene.analysis.phonetic.DaitchMokotoffSoundexFilterFactory is missing in phonetic's SPI file, but thats a separate issue. Please open issue.

        Show
        Uwe Schindler added a comment - org.apache.lucene.analysis.phonetic.DaitchMokotoffSoundexFilterFactory is missing in phonetic's SPI file, but thats a separate issue. Please open issue.
        Hide
        Shai Erera added a comment -

        Good catch Uwe Schindler! Removed.

        Show
        Shai Erera added a comment - Good catch Uwe Schindler ! Removed.
        Hide
        Uwe Schindler added a comment -

        OK looks fine. I did not run tests, I wonder why the tests of CustomAnalyzer or similar did not catch the non-existent classes.

        Show
        Uwe Schindler added a comment - OK looks fine. I did not run tests, I wonder why the tests of CustomAnalyzer or similar did not catch the non-existent classes.
        Hide
        Shai Erera added a comment -

        I ran the tests and TestRandomChains fails with this:

           [junit4]   2> NOTE: Windows 7 6.1 amd64/Oracle Corporation 1.8.0_40 (64-bit)/cpus=8,threads=1,free=393306544,total=510656512
           [junit4]   2> NOTE: All tests run in this JVM: [TestRandomChains]
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.seed=5FF882C20C905C54 -Dtests.slow=true -Dtests.locale=cs -Dtests.timezone=America/Buenos_Aires -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
           [junit4] ERROR   0.00s | TestRandomChains (suite) <<<
           [junit4]    > Throwable #1: java.lang.AssertionError: public org.apache.lucene.analysis.miscellaneous.DateRecognizerFilter(org.apache.lucene.analysis.TokenStream,java.text.DateFormat) has unsupported parameter types
           [junit4]    >        at __randomizedtesting.SeedInfo.seed([5FF882C20C905C54]:0)
           [junit4]    >        at org.apache.lucene.analysis.core.TestRandomChains.beforeClass(TestRandomChains.java:233)
           [junit4]    >        at java.lang.Thread.run(Thread.java:745)
        

        I tracked it to argProducers not having DataFormat defined. Is it OK to add it?

        Show
        Shai Erera added a comment - I ran the tests and TestRandomChains fails with this: [junit4] 2> NOTE: Windows 7 6.1 amd64/Oracle Corporation 1.8.0_40 (64-bit)/cpus=8,threads=1,free=393306544,total=510656512 [junit4] 2> NOTE: All tests run in this JVM: [TestRandomChains] [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.seed=5FF882C20C905C54 -Dtests.slow=true -Dtests.locale=cs -Dtests.timezone=America/Buenos_Aires -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.00s | TestRandomChains (suite) <<< [junit4] > Throwable #1: java.lang.AssertionError: public org.apache.lucene.analysis.miscellaneous.DateRecognizerFilter(org.apache.lucene.analysis.TokenStream,java.text.DateFormat) has unsupported parameter types [junit4] > at __randomizedtesting.SeedInfo.seed([5FF882C20C905C54]:0) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.beforeClass(TestRandomChains.java:233) [junit4] > at java.lang.Thread.run(Thread.java:745) I tracked it to argProducers not having DataFormat defined. Is it OK to add it?
        Hide
        Shai Erera added a comment -

        Patch addresses the following:

        • Removes TokenTypeFilter (I found TypeTokenFilter which does the same and more).
        • Removes TokenRangeFilter: the old one (Sink) had a bug IMO and in general I don't find this filter useful. It doesn't take into account other filters which drop tokens, so if you pass the range 3,5 it's not clear if you expect the original 3-5 terms, or any terms numbered 3-5. Anyway, I think it's trivial to implement if someone really needs such a filter, we don't have to offer it out-of-the-box.
        • Added an argProducer to TestRandomChains.

        Uwe Schindler I think it's ready. Would appreciate a final review though.

        Show
        Shai Erera added a comment - Patch addresses the following: Removes TokenTypeFilter (I found TypeTokenFilter which does the same and more). Removes TokenRangeFilter : the old one (Sink) had a bug IMO and in general I don't find this filter useful. It doesn't take into account other filters which drop tokens, so if you pass the range 3,5 it's not clear if you expect the original 3-5 terms, or any terms numbered 3-5. Anyway, I think it's trivial to implement if someone really needs such a filter, we don't have to offer it out-of-the-box. Added an argProducer to TestRandomChains . Uwe Schindler I think it's ready. Would appreciate a final review though.
        Hide
        Uwe Schindler added a comment -

        Yes, the only way to do this. I think we have a similar one already for some other date-related stuff.

        Show
        Uwe Schindler added a comment - Yes, the only way to do this. I think we have a similar one already for some other date-related stuff.
        Hide
        Uwe Schindler added a comment - - edited

        Oh missed last comment. Yes will do final review. I agree with the removal. Maybe make a API hint in CHANGES.txt to direct users to replacements for SinkFilters.

        Show
        Uwe Schindler added a comment - - edited Oh missed last comment. Yes will do final review. I agree with the removal. Maybe make a API hint in CHANGES.txt to direct users to replacements for SinkFilters.
        Hide
        Uwe Schindler added a comment - - edited

        The FilterFactory for the data detection should also list a locale:
        this.dateFormat = datePattern != null ? new SimpleDateFormat(datePattern, Locale.ROOT) : null; is not useful for custom formatted dates. E.g. the root locale has no month names in CLDR (which happened in Java 9), only "Month 1"

        So The factory should also take a locale (we have that at other places, too).

        The default in the filter should better use Locale.ENGLISH, otherwise it will likely break with Java 9.

        Show
        Uwe Schindler added a comment - - edited The FilterFactory for the data detection should also list a locale: this.dateFormat = datePattern != null ? new SimpleDateFormat(datePattern, Locale.ROOT) : null; is not useful for custom formatted dates. E.g. the root locale has no month names in CLDR (which happened in Java 9), only "Month 1" So The factory should also take a locale (we have that at other places, too). The default in the filter should better use Locale.ENGLISH, otherwise it will likely break with Java 9.
        Hide
        Shai Erera added a comment -

        Thanks Uwe Schindler for the review. I added some notes to CHANGES as well Locale support to the factory.

        Show
        Shai Erera added a comment - Thanks Uwe Schindler for the review. I added some notes to CHANGES as well Locale support to the factory.
        Hide
        Uwe Schindler added a comment -

        Hi,

        still Javadocs of the default are wrong, it should say that it uses english language to detect date formats by default. Also there is Locale.ROOT used in one example with month names.

        Currently, I am not sure if forLanguageTag() is a good idea, because this one may not accept the usual formatting with underscore, like "en_US" - haven't tried (there is also no test of factory). In LuceneTestCase we use a handwritten method static Locale localeForName(String localeName). But I am not 100% sure if this is just a relict from pre-Java-7 days. Maybe Robert Muir can comment on this. If forLanguageTag() is fine, we could also use it in LTC.

        Show
        Uwe Schindler added a comment - Hi, still Javadocs of the default are wrong, it should say that it uses english language to detect date formats by default. Also there is Locale.ROOT used in one example with month names. Currently, I am not sure if forLanguageTag() is a good idea, because this one may not accept the usual formatting with underscore, like "en_US" - haven't tried (there is also no test of factory). In LuceneTestCase we use a handwritten method static Locale localeForName(String localeName) . But I am not 100% sure if this is just a relict from pre-Java-7 days. Maybe Robert Muir can comment on this. If forLanguageTag() is fine, we could also use it in LTC.
        Hide
        Uwe Schindler added a comment -

        Robert had a comment on: LUCENE-4021

        Show
        Uwe Schindler added a comment - Robert had a comment on: LUCENE-4021
        Hide
        Uwe Schindler added a comment -

        I tried it: forLanguage tag does not accept "en_US", it returns the root locale instead!!! Not even throwing Exception! (see javadocs). So we should maybe add the LTC method into a Utility class. It works with "en-US", but people would not use that. So Java 7 is still missing a method to construct a Locale from their toString() output.

        Show
        Uwe Schindler added a comment - I tried it: forLanguage tag does not accept "en_US", it returns the root locale instead!!! Not even throwing Exception! (see javadocs). So we should maybe add the LTC method into a Utility class. It works with "en-US", but people would not use that. So Java 7 is still missing a method to construct a Locale from their toString() output.
        Hide
        Robert Muir added a comment -

        Try this:

        If the specified language tag contains any ill-formed subtags, the first such subtag and all following subtags are ignored. Compare to Locale.Builder.setLanguageTag(java.lang.String) which throws an exception in this case.

        Show
        Robert Muir added a comment - Try this: If the specified language tag contains any ill-formed subtags, the first such subtag and all following subtags are ignored. Compare to Locale.Builder.setLanguageTag(java.lang.String) which throws an exception in this case.
        Hide
        Robert Muir added a comment -

        and i dont know if that will fix your en_US problem, i think not, but it should be less lenient at least.

        in general i prefer language tags rather than a ton of locale options. And it has the ability to pass some super-expert stuff just via one parameter (e.g. alternate calendar). but we may need to add some checks to prevent mistakes like what you show.

        Show
        Robert Muir added a comment - and i dont know if that will fix your en_US problem, i think not, but it should be less lenient at least. in general i prefer language tags rather than a ton of locale options. And it has the ability to pass some super-expert stuff just via one parameter (e.g. alternate calendar). but we may need to add some checks to prevent mistakes like what you show.
        Hide
        Uwe Schindler added a comment -

        I also agree that language tag (because they are "standardized") would better match. For LTC there is no backwards problem, and the one here is new. But we should also review Solr code where it tries to parse Locale names.

        So I think we should also open a separate issue:

        • let LTC print the language tag instead of toString() on failures
        • make LTC parse the input locale using the builder that throws exception.

        For the current factory, we should use the non-lenient variant with Locale.Builder. Because resolving "en_US" to the root locale is a no-go.

        Show
        Uwe Schindler added a comment - I also agree that language tag (because they are "standardized") would better match. For LTC there is no backwards problem, and the one here is new. But we should also review Solr code where it tries to parse Locale names. So I think we should also open a separate issue: let LTC print the language tag instead of toString() on failures make LTC parse the input locale using the builder that throws exception. For the current factory, we should use the non-lenient variant with Locale.Builder. Because resolving "en_US" to the root locale is a no-go.
        Hide
        Uwe Schindler added a comment - - edited

        I checked:

        new Locale.Builder().setLanguageTag("en_US").build()
        

        This works as expected. On "en_US" it throws: java.util.IllformedLocaleException: Invalid subtag: en_US [at index 0]
        "en-us" or "en-US" works as expected.

        Show
        Uwe Schindler added a comment - - edited I checked: new Locale.Builder().setLanguageTag( "en_US" ).build() This works as expected. On "en_US" it throws: java.util.IllformedLocaleException: Invalid subtag: en_US [at index 0] "en-us" or "en-US" works as expected.
        Hide
        Shai Erera added a comment -

        Thanks Uwe Schindler and Robert Muir. Patch adds DateRecognizerFilterFactoryTest and moves to use Locale.Builder. I also fixed the javadocs to reflect that.

        Show
        Shai Erera added a comment - Thanks Uwe Schindler and Robert Muir . Patch adds DateRecognizerFilterFactoryTest and moves to use Locale.Builder . I also fixed the javadocs to reflect that.
        Hide
        Uwe Schindler added a comment -

        +1 to commit, only a small change: Use LTC.getRandomLocale(Random) in the ArgProducer.

        Show
        Uwe Schindler added a comment - +1 to commit, only a small change: Use LTC.getRandomLocale(Random) in the ArgProducer.
        Hide
        ASF subversion and git services added a comment -

        Commit 1724789 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1724789 ]

        LUCENE-6973: Improve TeeSinkTokenFilter

        Show
        ASF subversion and git services added a comment - Commit 1724789 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1724789 ] LUCENE-6973 : Improve TeeSinkTokenFilter
        Hide
        ASF subversion and git services added a comment -

        Commit 1724795 from Shai Erera in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1724795 ]

        LUCENE-6973: Improve TeeSinkTokenFilter

        Show
        ASF subversion and git services added a comment - Commit 1724795 from Shai Erera in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1724795 ] LUCENE-6973 : Improve TeeSinkTokenFilter
        Hide
        Shai Erera added a comment -

        Thanks Uwe Schindler, I addressed your last comment in the commit.

        Show
        Shai Erera added a comment - Thanks Uwe Schindler , I addressed your last comment in the commit.
        Hide
        Uwe Schindler added a comment -

        I opened LUCENE-6978 about improving LTC.

        Show
        Uwe Schindler added a comment - I opened LUCENE-6978 about improving LTC.

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development