Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      Adds the possibility to set a TokenStream at Field constrution time, available as tokenStreamValue in addition to stringValue, readerValue and binaryValue.

      There might be some problems with mixing stored fields with the same name as a field with tokenStreamValue.

      1. trunk.diff
        13 kB
        Karl Wettin
      2. preanalyze.tar
        10 kB
        Karl Wettin
      3. lucene-580.patch
        21 kB
        Michael Busch
      4. lucene-580.patch
        15 kB
        Karl Wettin

        Activity

        Hide
        michaelbusch Michael Busch added a comment -

        I just committed this. Thanks a lot for your patches and the productive discussions, Karl!

        Show
        michaelbusch Michael Busch added a comment - I just committed this. Thanks a lot for your patches and the productive discussions, Karl!
        Hide
        michaelbusch Michael Busch added a comment -

        I hesitate to add even more constructors to Field, since it has already a bunch of them. And I don't see the value in the constructor you proposed, because the same can be achieved by using setOmitNorms(). I think it is a bit confusing to have the Index parameter with only two out of four allowed values in the constructor.

        Show
        michaelbusch Michael Busch added a comment - I hesitate to add even more constructors to Field, since it has already a bunch of them. And I don't see the value in the constructor you proposed, because the same can be achieved by using setOmitNorms(). I think it is a bit confusing to have the Index parameter with only two out of four allowed values in the constructor.
        Hide
        karl.wettin Karl Wettin added a comment -

        How about two constructors?

        Show
        karl.wettin Karl Wettin added a comment - How about two constructors?
        Hide
        michaelbusch Michael Busch added a comment -

        I'd like to keep the new Constructor consistent with the Constructor that takes a Reader argument. You can use Fieldable.setOmitNorms() to disable norms.

        Show
        michaelbusch Michael Busch added a comment - I'd like to keep the new Constructor consistent with the Constructor that takes a Reader argument. You can use Fieldable.setOmitNorms() to disable norms.
        Hide
        karl.wettin Karl Wettin added a comment -

        A patch that supports omitting norms.

        /**

        • Create a tokenized and indexed field that is not stored, optionally with
        • storing term vectors.
        • @param name The name of the field
        • @param tokenStream The reader with the content
          + * @param index Must be tokenized or no norms.
        • @param termVector Whether term vector should be stored
        • @throws NullPointerException if name or reader is <code>null</code>
          */
        • public Field(String name, TokenStream tokenStream, TermVector termVector) {
          + public Field(String name, TokenStream tokenStream, Index index, TermVector termVector) {

        if (name == null)
        throw new NullPointerException("name cannot be null");
        if (tokenStream == null)
        throw new NullPointerException("tokenStream cannot be null");
        + if (index != Index.TOKENIZED && index != Index.NO_NORMS)

        { + throw new IllegalArgumentException("index must be either TOKENIZED or NO_NORMS"); + }

        Also fixed some copy/paste related javadoc errors.

        Show
        karl.wettin Karl Wettin added a comment - A patch that supports omitting norms. /** Create a tokenized and indexed field that is not stored, optionally with storing term vectors. @param name The name of the field @param tokenStream The reader with the content + * @param index Must be tokenized or no norms. @param termVector Whether term vector should be stored @throws NullPointerException if name or reader is <code>null</code> */ public Field(String name, TokenStream tokenStream, TermVector termVector) { + public Field(String name, TokenStream tokenStream, Index index, TermVector termVector) { if (name == null) throw new NullPointerException("name cannot be null"); if (tokenStream == null) throw new NullPointerException("tokenStream cannot be null"); + if (index != Index.TOKENIZED && index != Index.NO_NORMS) { + throw new IllegalArgumentException("index must be either TOKENIZED or NO_NORMS"); + } Also fixed some copy/paste related javadoc errors.
        Hide
        karl.wettin Karl Wettin added a comment -

        28 apr 2007 kl. 20.42 skrev Michael Busch (JIRA):
        > Please let me know what you thing about this new patch.

        +1. Very clean.

        Show
        karl.wettin Karl Wettin added a comment - 28 apr 2007 kl. 20.42 skrev Michael Busch (JIRA): > Please let me know what you thing about this new patch. +1. Very clean.
        Hide
        michaelbusch Michael Busch added a comment -

        Karl,

        thanks again for your suggestions. I created a patch with a slightly different approach compared to your latest patch. Similar to java.io.Reader I added the public method reset() to TokenStream, which does nothing per default. Subclasses may or may not overwrite this method. I also added the new class CachingTokenFilter to the analysis package which does the same as your CachedPreAnalyzedField. Before the DocumentWriter consumes the TokenStream it calls reset() reposition the stream at the beginning.

        With this approach it is not neccessary anymore to introduce a TokenStreamFactory and the PreAnalyzedField classes, which is simpler and more consistent with the Analyzer API in my opinion. Yet it also allows to consume the Tokens of a stream more than once, which should satisfy your needs?

        Please let me know what you thing about this new patch. Maybe other committers could take a look as well, since this is an API change (well, extension) to two very common classes: TokenStream and Field.

        Show
        michaelbusch Michael Busch added a comment - Karl, thanks again for your suggestions. I created a patch with a slightly different approach compared to your latest patch. Similar to java.io.Reader I added the public method reset() to TokenStream, which does nothing per default. Subclasses may or may not overwrite this method. I also added the new class CachingTokenFilter to the analysis package which does the same as your CachedPreAnalyzedField. Before the DocumentWriter consumes the TokenStream it calls reset() reposition the stream at the beginning. With this approach it is not neccessary anymore to introduce a TokenStreamFactory and the PreAnalyzedField classes, which is simpler and more consistent with the Analyzer API in my opinion. Yet it also allows to consume the Tokens of a stream more than once, which should satisfy your needs? Please let me know what you thing about this new patch. Maybe other committers could take a look as well, since this is an API change (well, extension) to two very common classes: TokenStream and Field.
        Hide
        karl.wettin Karl Wettin added a comment -

        25 apr 2007 kl. 10.23 skrev Michael Busch (JIRA):
        > What kind of use cases do you have in mind with the cached field?

        I made the inital implementation for text mining purposes – I needed the term vector prior to inserting the document to the index. Back then I analyzed, cached it up, did my secondary analysis of the vector, and finally reconstructed the token stream and passed it to the field. I think it would be easier to just pass a token stream to an extention of CachedPreAn.. that also features a termFreqVector(), termPosVector(), et c.

        Karl

        Show
        karl.wettin Karl Wettin added a comment - 25 apr 2007 kl. 10.23 skrev Michael Busch (JIRA): > What kind of use cases do you have in mind with the cached field? I made the inital implementation for text mining purposes – I needed the term vector prior to inserting the document to the index. Back then I analyzed, cached it up, did my secondary analysis of the vector, and finally reconstructed the token stream and passed it to the field. I think it would be easier to just pass a token stream to an extention of CachedPreAn.. that also features a termFreqVector(), termPosVector(), et c. Karl
        Hide
        michaelbusch Michael Busch added a comment -

        Hi Karl,

        thank you for your comments and your new patch! I actually updated you initial patch a couple of days ago to apply cleanly on the current trunk as I found it quite simple and straightforward.

        I have a question regarding the new patch: What kind of use cases do you have in mind with the cached field? I would think that a TokenStream will only be read once by the DocumentWriter? Currently you can add a field with a Reader, and as far as I know is it not guaranteed that a Reader supports reset(), which means it can only be read once, too. Thoughts?

        Michael

        Show
        michaelbusch Michael Busch added a comment - Hi Karl, thank you for your comments and your new patch! I actually updated you initial patch a couple of days ago to apply cleanly on the current trunk as I found it quite simple and straightforward. I have a question regarding the new patch: What kind of use cases do you have in mind with the cached field? I would think that a TokenStream will only be read once by the DocumentWriter? Currently you can add a field with a Reader, and as far as I know is it not guaranteed that a Reader supports reset(), which means it can only be read once, too. Thoughts? Michael
        Hide
        karl.wettin Karl Wettin added a comment -

        An implementation suggestion. The project will not compile as there are two Fieldable implementations that does not implement the tokenStreamFactoryValue() method, LazyField and FieldForMerge. I suspect they could just return null values, but I leave it open as I'm not sure.

        Patch contains:

        public interface TokenStreamFactory {
        public abstract TokenStream factory() throws IOException;
        }

        /** Encapuslates a single instance of TokenStream that can be passed on ONCE. */
        public class SimplePreAnalyzedField implements TokenStreamFactory {

        /** Caches an instance of TokenStream in a List, reassembled to a TokenStream for each call to factory() */
        public class CachedPreAnalyzedField implements TokenStreamFactory {

        Show
        karl.wettin Karl Wettin added a comment - An implementation suggestion. The project will not compile as there are two Fieldable implementations that does not implement the tokenStreamFactoryValue() method, LazyField and FieldForMerge. I suspect they could just return null values, but I leave it open as I'm not sure. Patch contains: public interface TokenStreamFactory { public abstract TokenStream factory() throws IOException; } /** Encapuslates a single instance of TokenStream that can be passed on ONCE. */ public class SimplePreAnalyzedField implements TokenStreamFactory { /** Caches an instance of TokenStream in a List, reassembled to a TokenStream for each call to factory() */ public class CachedPreAnalyzedField implements TokenStreamFactory {
        Hide
        karl.wettin Karl Wettin added a comment -

        Perhaps it would be nice if standard term vector, positions et c. was available for easy extraction? Pass a TokenStream to the constructor of the pre analyzed field, buffer it up in a LinkedHashMap, or something in that direction.

        Show
        karl.wettin Karl Wettin added a comment - Perhaps it would be nice if standard term vector, positions et c. was available for easy extraction? Pass a TokenStream to the constructor of the pre analyzed field, buffer it up in a LinkedHashMap, or something in that direction.
        Hide
        karl.wettin Karl Wettin added a comment -

        Michael, let me know if you start working on this. I would not mind a discussion on how the token stream supplier should look like. For one it should be an interface/class that supplies the token stream and not a TokenStream it self as it could only be read once. Or?

        Show
        karl.wettin Karl Wettin added a comment - Michael, let me know if you start working on this. I would not mind a discussion on how the token stream supplier should look like. For one it should be an interface/class that supplies the token stream and not a TokenStream it self as it could only be read once. Or?
        Hide
        karl.wettin Karl Wettin added a comment -

        Nadav Har'El [18/Jan/07 08:21 AM]

        > The description above suggests that it might not work if the same
        > field name is used for two Field's, one stored and the other preanalyzed.
        > I think it is important that this combination (as well as all other
        > combinations) are supported. I actually use all these combinations in my
        > code, and I don't see why it should cause problems.

        Actually, I can't remember why I thought there could be a problem, nor can I think of one now. This code is from my pre-tests era, and it should could need some.

        If you like this strategy, I think it would be more elegant passing a factory rather than the actual token stream.

        > The patch has some strange changes in the comments, changing the word
        > "Index" to "NotificationService". I bet this wasn't intentional

        Hehe, that must be an old refactoring search and replace incident.

        Show
        karl.wettin Karl Wettin added a comment - Nadav Har'El [18/Jan/07 08:21 AM] > The description above suggests that it might not work if the same > field name is used for two Field's, one stored and the other preanalyzed. > I think it is important that this combination (as well as all other > combinations) are supported. I actually use all these combinations in my > code, and I don't see why it should cause problems. Actually, I can't remember why I thought there could be a problem, nor can I think of one now. This code is from my pre-tests era, and it should could need some. If you like this strategy, I think it would be more elegant passing a factory rather than the actual token stream. > The patch has some strange changes in the comments, changing the word > "Index" to "NotificationService". I bet this wasn't intentional Hehe, that must be an old refactoring search and replace incident.
        Hide
        nyh Nadav Har'El added a comment -

        This patch will be useful for users LUCENE-755, the payloads patch. That patch adds "payloads" to tokens, but using it to add a few tokens with payloads in some field can be ugly because you need to split the code into two places: at one place you add the field, only text, and at another place you need to write a special analyzer which will work only on that field, recognize the specific tokens and add the payloads to them. This patch makes this easier, because when you add a field, you can add it pre-analyzed, already as a list of tokens, and these tokens will already have their special payloads in them.

        I have just a few comments on this patch:

        1. The description above suggests that it might not work if the same field name is used for two Field's, one stored and the other preanalyzed. I think it is important that this combination (as well as all other combinations) are supported. I actually use all these combinations in my code, and I don't see why it should cause problems.

        2. The patch has some strange changes in the comments, changing the word "Index" to "NotificationService". I bet this wasn't intentional

        3. The new Field constructor still has a "Index" paramter, taking TOKENIZED, UN_TOKENIZED or NO_NORMS (only NO is forbidden). I wonder, what's the difference between TOKENIZED and UN_TOKENIZED in this case? The NO_NORMS is a very useful case, because it allows you to do something not previously possible in Lucene (a tokenized field, but without norms). Perhaps this parameter should be better documented in the javadoc comment.

        4. In the new Field constructor's comment, the phrase "if name or reader" should be "if name or tokenStream".

        Thanks!

        Show
        nyh Nadav Har'El added a comment - This patch will be useful for users LUCENE-755 , the payloads patch. That patch adds "payloads" to tokens, but using it to add a few tokens with payloads in some field can be ugly because you need to split the code into two places: at one place you add the field, only text, and at another place you need to write a special analyzer which will work only on that field, recognize the specific tokens and add the payloads to them. This patch makes this easier, because when you add a field, you can add it pre-analyzed, already as a list of tokens, and these tokens will already have their special payloads in them. I have just a few comments on this patch: 1. The description above suggests that it might not work if the same field name is used for two Field's, one stored and the other preanalyzed. I think it is important that this combination (as well as all other combinations) are supported. I actually use all these combinations in my code, and I don't see why it should cause problems. 2. The patch has some strange changes in the comments, changing the word "Index" to "NotificationService". I bet this wasn't intentional 3. The new Field constructor still has a "Index" paramter, taking TOKENIZED, UN_TOKENIZED or NO_NORMS (only NO is forbidden). I wonder, what's the difference between TOKENIZED and UN_TOKENIZED in this case? The NO_NORMS is a very useful case, because it allows you to do something not previously possible in Lucene (a tokenized field, but without norms). Perhaps this parameter should be better documented in the javadoc comment. 4. In the new Field constructor's comment, the phrase "if name or reader" should be "if name or tokenStream". Thanks!
        Hide
        karl.wettin Karl Wettin added a comment -

        The description should be "This patch"... and not "This page"...

        Show
        karl.wettin Karl Wettin added a comment - The description should be "This patch"... and not "This page"...
        Hide
        karl.wettin Karl Wettin added a comment -

        Field.java.diff
        DocumentWriter.java.diff

        Show
        karl.wettin Karl Wettin added a comment - Field.java.diff DocumentWriter.java.diff

          People

          • Assignee:
            michaelbusch Michael Busch
            Reporter:
            karl.wettin Karl Wettin
          • Votes:
            3 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development