Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7429

DelegatingAnalyzerWrapper should delegate normalization too

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 6.2
    • Fix Version/s: master (7.0), 6.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is something that I overlooked in LUCENE-7355: (Delegating)AnalyzerWrapper uses the default implementation of initReaderForNormalization and normalize, meaning that by default the normalization is a no-op. It should delegate to the wrapped analyzer.

      1. LUCENE-7355.patch
        6 kB
        Adrien Grand
      2. LUCENE-7429.patch
        13 kB
        Adrien Grand
      3. LUCENE-7429.patch
        13 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch.

        Show
        jpountz Adrien Grand added a comment - Here is a patch.
        Hide
        jpountz Adrien Grand added a comment -

        New patch that also makes sure to use the same attribute factory as the wrapped analyzer and adds more tests. I think it's ready.

        Show
        jpountz Adrien Grand added a comment - New patch that also makes sure to use the same attribute factory as the wrapped analyzer and adds more tests. I think it's ready.
        Hide
        jpountz Adrien Grand added a comment -

        Uwe Schindler Would you mind having a look?

        Show
        jpountz Adrien Grand added a comment - Uwe Schindler Would you mind having a look?
        Hide
        thetaphi Uwe Schindler added a comment -

        I am trying to understand what's going on. The attributeFactory / fieldname looks strange to me....

        Show
        thetaphi Uwe Schindler added a comment - I am trying to understand what's going on. The attributeFactory / fieldname looks strange to me....
        Hide
        jpountz Adrien Grand added a comment -

        This part was necessary because AnalyzerWrapper has Analyzer getWrappedAnalyzer(String fieldName). So in order to know which attribute factory to use, I had to add a fieldName parameter, so that AnalyzerWrapper.attributeFactory can be implemented as:

          @Override
          protected final AttributeFactory attributeFactory(String fieldName) {
            return getWrappedAnalyzer(fieldName).attributeFactory(fieldName);
          }
        

        Otherwise we could not figure out which analyzer to delegate to?

        Show
        jpountz Adrien Grand added a comment - This part was necessary because AnalyzerWrapper has Analyzer getWrappedAnalyzer(String fieldName) . So in order to know which attribute factory to use, I had to add a fieldName parameter, so that AnalyzerWrapper.attributeFactory can be implemented as: @Override protected final AttributeFactory attributeFactory( String fieldName) { return getWrappedAnalyzer(fieldName).attributeFactory(fieldName); } Otherwise we could not figure out which analyzer to delegate to?
        Hide
        thetaphi Uwe Schindler added a comment -

        I know why you did this, but it just looks wrong! IMHO, the whole attributefactory as part of the ctor of Tokenizers is not really a good idea, but some people like to have this! The AttributeFactory should be passed by the ctor of the subclass of a Tokenizer to super, if a Tokenizer needs a special one. But passing that around everywhere is not really useful.

        Show
        thetaphi Uwe Schindler added a comment - I know why you did this, but it just looks wrong! IMHO, the whole attributefactory as part of the ctor of Tokenizers is not really a good idea, but some people like to have this! The AttributeFactory should be passed by the ctor of the subclass of a Tokenizer to super, if a Tokenizer needs a special one. But passing that around everywhere is not really useful.
        Hide
        jpountz Adrien Grand added a comment -

        Agreed on that point. Since all we need for normalization is to use the same attribute factory as the regular analysis chain, would it work better for you if we tried to borrow the attribute source from a token stream created by Analyzer.tokenStream like in the attached patch (the relevant bits are the removal of Analyzer.attributeFactory() and how Analyzer.normalize assigns the attributeFactory variable?

        Show
        jpountz Adrien Grand added a comment - Agreed on that point. Since all we need for normalization is to use the same attribute factory as the regular analysis chain, would it work better for you if we tried to borrow the attribute source from a token stream created by Analyzer.tokenStream like in the attached patch (the relevant bits are the removal of Analyzer.attributeFactory() and how Analyzer.normalize assigns the attributeFactory variable?
        Hide
        thetaphi Uwe Schindler added a comment -

        Hi,
        I'd suggest to use the previous patch. I don't like it to create (and consume) a TokenStream just to get its attribute source. This may be heavy, although it is unlikely for stuff that gets normalized.
        The issue here is mostly that we need to create a new TokenStream (StringTokenStream) for the normalization and we need to use the same attribute types. Although this is sometimes broken for use-cases, where TokenStreams create binary tokens. But those would never be normalized, I think (!?)
        Uwe

        Show
        thetaphi Uwe Schindler added a comment - Hi, I'd suggest to use the previous patch. I don't like it to create (and consume) a TokenStream just to get its attribute source. This may be heavy, although it is unlikely for stuff that gets normalized. The issue here is mostly that we need to create a new TokenStream (StringTokenStream) for the normalization and we need to use the same attribute types. Although this is sometimes broken for use-cases, where TokenStreams create binary tokens. But those would never be normalized, I think (!?) Uwe
        Hide
        jpountz Adrien Grand added a comment -

        The issue here is mostly that we need to create a new TokenStream (StringTokenStream) for the normalization and we need to use the same attribute types.

        Exactly. For instance if a term attribute produces utf-16 encoded tokens,

        Although this is sometimes broken for use-cases, where TokenStreams create binary tokens. But those would never be normalized, I think (!?)

        Do you mean that you cannot think of any use-case for using both a non-default term attribute and token filters in the same analysis chain? I am wondering about CJK analyzers since I think UTF16 stores CJK characters a bit more efficiently on average than UTF8 (I may be completely wrong, in which case please let me know) so users might be tempted to use a different term attribute impl?

        Show
        jpountz Adrien Grand added a comment - The issue here is mostly that we need to create a new TokenStream (StringTokenStream) for the normalization and we need to use the same attribute types. Exactly. For instance if a term attribute produces utf-16 encoded tokens, Although this is sometimes broken for use-cases, where TokenStreams create binary tokens. But those would never be normalized, I think (!?) Do you mean that you cannot think of any use-case for using both a non-default term attribute and token filters in the same analysis chain? I am wondering about CJK analyzers since I think UTF16 stores CJK characters a bit more efficiently on average than UTF8 (I may be completely wrong, in which case please let me know) so users might be tempted to use a different term attribute impl?
        Hide
        jpountz Adrien Grand added a comment -

        I would like to move forward on this issue with the patch that I proposed on August 29th if there are no objections.

        Show
        jpountz Adrien Grand added a comment - I would like to move forward on this issue with the patch that I proposed on August 29th if there are no objections.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ed102d634a7aa48c1b995ba81548cc7454a467a9 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ed102d6 ]

        LUCENE-7429: AnalyzerWrapper can now wrap the normalization analysis chain too.

        Show
        jira-bot ASF subversion and git services added a comment - Commit ed102d634a7aa48c1b995ba81548cc7454a467a9 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ed102d6 ] LUCENE-7429 : AnalyzerWrapper can now wrap the normalization analysis chain too.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit af60048097a83220aae135b09d209a0f2d4ba3c6 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=af60048 ]

        LUCENE-7429: AnalyzerWrapper can now wrap the normalization analysis chain too.

        Show
        jira-bot ASF subversion and git services added a comment - Commit af60048097a83220aae135b09d209a0f2d4ba3c6 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=af60048 ] LUCENE-7429 : AnalyzerWrapper can now wrap the normalization analysis chain too.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Closing after 6.3.0 release.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development