Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1804

Can't specify AttributeSource for Tokenizer

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      One can't currently specify the attribute source for a Tokenizer like one can with any other TokenStream.

      1. LUCENE-1804.patch
        0.8 kB
        Yonik Seeley

        Activity

        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Committed.

        I'm not sure it's worth adding constructors for all combinations of parameters, esp when the trend is toward reuse, and specifying the reader separately - but I think that can be a different issue (whether to remove some of the existing constructors or not).

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Committed. I'm not sure it's worth adding constructors for all combinations of parameters, esp when the trend is toward reuse, and specifying the reader separately - but I think that can be a different issue (whether to remove some of the existing constructors or not).
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        OO design principal of not removing functionality - Tokenizer's superclass can specify it's AttributeSource... why can't Tokenizer? We shouldn't disallow it just because we can't immediately think of a use case.

        I am still not sure, why a simple TokenFilter does not serve the same pupose you would like to have with Tokenizer here.

        Simplest case: a Tokenizer that delegates to an existing Tokenizer or TokenStream?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - OO design principal of not removing functionality - Tokenizer's superclass can specify it's AttributeSource... why can't Tokenizer? We shouldn't disallow it just because we can't immediately think of a use case. I am still not sure, why a simple TokenFilter does not serve the same pupose you would like to have with Tokenizer here. Simplest case: a Tokenizer that delegates to an existing Tokenizer or TokenStream?
        Hide
        thetaphi Uwe Schindler added a comment -

        Normally it would be ok. E.g. in the reuse of TokenStreams, the simpliest would be to create the tokenizer with a null Reader first and only reset(Reader) it before first use. I think, this has historical reasons and to keep consistent we should add the ctors. Or deprecate all Reader ctors and state, that you should create a reusable Tokenizer and call reset(Reader).

        I am still not sure, why a simple TokenFilter does not serve the same pupose you would like to have with Tokenizer here. Why not simply wrap the Tokenizer with a TokenFilter that already has the possibility to delegate? If it is because you miss the reset(Reader) call, we could think about adding this to TokenFilter, that passes to the delegated Tokenizer (using instanceof checks).

        Show
        thetaphi Uwe Schindler added a comment - Normally it would be ok. E.g. in the reuse of TokenStreams, the simpliest would be to create the tokenizer with a null Reader first and only reset(Reader) it before first use. I think, this has historical reasons and to keep consistent we should add the ctors. Or deprecate all Reader ctors and state, that you should create a reusable Tokenizer and call reset(Reader). I am still not sure, why a simple TokenFilter does not serve the same pupose you would like to have with Tokenizer here. Why not simply wrap the Tokenizer with a TokenFilter that already has the possibility to delegate? If it is because you miss the reset(Reader) call, we could think about adding this to TokenFilter, that passes to the delegated Tokenizer (using instanceof checks).
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        But for completeness, this ctor should also get the Reader/CharStream (as all other ctors have the Reader param).

        Wouldn't tokenizer.reset(reader) serve the same purpose? I'm not sure why all those different constructors are there.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - But for completeness, this ctor should also get the Reader/CharStream (as all other ctors have the Reader param). Wouldn't tokenizer.reset(reader) serve the same purpose? I'm not sure why all those different constructors are there.
        Hide
        thetaphi Uwe Schindler added a comment -

        OK, I was wondering, because TokenFilter is there for this pupose and TokenStream only provides the AttributeSource ctor because the TokenFilter subclass needs this. So one could also simply create a TokenFilter and put it ontop of the Tokenizer to wrap? new TokenFilter(new WrappedTokenizer()) - why need a Tokenizer for that when TokenFilter is made for it?

        But for completeness, this ctor should also get the Reader/CharStream (as all other ctors have the Reader param).

        Show
        thetaphi Uwe Schindler added a comment - OK, I was wondering, because TokenFilter is there for this pupose and TokenStream only provides the AttributeSource ctor because the TokenFilter subclass needs this. So one could also simply create a TokenFilter and put it ontop of the Tokenizer to wrap? new TokenFilter(new WrappedTokenizer()) - why need a Tokenizer for that when TokenFilter is made for it? But for completeness, this ctor should also get the Reader/CharStream (as all other ctors have the Reader param).
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        It makes delegation possible. Say one wanted to create a new Tokenizer by wrapping an existing Tokenizer or TokenStream.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - It makes delegation possible. Say one wanted to create a new Tokenizer by wrapping an existing Tokenizer or TokenStream.
        Hide
        thetaphi Uwe Schindler added a comment -

        Why do you need this?

        Show
        thetaphi Uwe Schindler added a comment - Why do you need this?

          People

          • Assignee:
            Unassigned
            Reporter:
            yseeley@gmail.com Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development