Lucene - Core
  1. Lucene - Core
  2. LUCENE-2309

Fully decouple IndexWriter from analyzers

    Details

    • Lucene Fields:
      New

      Description

      IndexWriter only needs an AttributeSource to do indexing.

      Yet, today, it interacts with Field instances, holds a private
      analyzers, invokes analyzer.reusableTokenStream, has to deal with a
      wide variety (it's not analyzed; it is analyzed but it's a Reader,
      String; it's pre-analyzed).

      I'd like to have IW only interact with attr sources that already
      arrived with the fields. This would be a powerful decoupling – it
      means others are free to make their own attr sources.

      They need not even use any of Lucene's analysis impls; eg they can
      integrate to other things like OpenPipeline.
      Or make something completely custom.

      LUCENE-2302 is already a big step towards this: it makes IW agnostic
      about which attr is "the term", and only requires that it provide a
      BytesRef (for flex).

      Then I think LUCENE-2308 would get us most of the remaining way – ie, if the
      FieldType knows the analyzer to use, then we could simply create a
      getAttrSource() method (say) on it and move all the logic IW has today
      onto there. (We'd still need existing IW code for back-compat).

      1. LUCENE-2309.patch
        14 kB
        Chris Male
      2. LUCENE-2309-analyzer-based.patch
        13 kB
        Chris Male
      3. LUCENE-2309-getTSFromField.patch
        19 kB
        Chris Male

        Activity

        Hide
        Michael McCandless added a comment -

        We can't use attr source directly – we'd need to factor out
        the minimal API from TokenStream (.incrToken & .end?) and
        use that (thanks Robert!).

        Show
        Michael McCandless added a comment - We can't use attr source directly – we'd need to factor out the minimal API from TokenStream (.incrToken & .end?) and use that (thanks Robert!).
        Hide
        Shai Erera added a comment -

        Would this mean that after that we can move all of core Analyzers to contrib/analyzers, making one step towards getting them completely out of Lucene and into their own Apache project?

        That way, we can keep in core only the AttributeSource and accompanying classes, and really allow people to pass AttributeSource which is not even an Analyzer (like you said). We can move the specific Analyzer tests to contrib/analyzers as well. The other tests in core, who don't care about analysis, can use a src/test specific AttributeSource, like TestAttributeSourceImpl ...

        I'm thinking - it's ok for contrib to depend on core but not the other way around. It will however take out of core a useful feature for new users which allows fast bootstrap. That won't be the case when analyzers move out of Lucene entirely, but while they are in Lucene, we'll force everyone to download contrib/analyzers as well. So maybe we keep in core only Standard, or maybe even something simpler, again, for easy bootstrapping (like Whitespace + lowercase).

        This is just a thought.

        Show
        Shai Erera added a comment - Would this mean that after that we can move all of core Analyzers to contrib/analyzers, making one step towards getting them completely out of Lucene and into their own Apache project? That way, we can keep in core only the AttributeSource and accompanying classes, and really allow people to pass AttributeSource which is not even an Analyzer (like you said). We can move the specific Analyzer tests to contrib/analyzers as well. The other tests in core, who don't care about analysis, can use a src/test specific AttributeSource, like TestAttributeSourceImpl ... I'm thinking - it's ok for contrib to depend on core but not the other way around. It will however take out of core a useful feature for new users which allows fast bootstrap. That won't be the case when analyzers move out of Lucene entirely, but while they are in Lucene, we'll force everyone to download contrib/analyzers as well. So maybe we keep in core only Standard, or maybe even something simpler, again, for easy bootstrapping (like Whitespace + lowercase). This is just a thought.
        Hide
        Michael McCandless added a comment -

        Would this mean that after that we can move all of core Analyzers to contrib/analyzers

        Yes, though, I think that's orthogonal (can and should be separately
        done, anyway).

        making one step towards getting them completely out of Lucene and into their own Apache project?

        We may simply "standardize" on contrib/analyzers as the one place,
        instead of a new [sub-]project. To be discussed... but we really do
        need one place.

        That way, we can keep in core only the AttributeSource and accompanying classes, and really allow people to pass AttributeSource which is not even an Analyzer (like you said). We can move the specific Analyzer tests to contrib/analyzers as well. The other tests in core, who don't care about analysis, can use a src/test specific AttributeSource, like TestAttributeSourceImpl ...

        Right.

        I'm thinking - it's ok for contrib to depend on core but not the other way around.

        I agree.

        It will however take out of core a useful feature for new users which allows fast bootstrap.

        Well.. I suspect with this change users would not typically use
        lucene-core alone. Ie, they'd get analyzers and queryparser (if we
        also move it out as its own module).

        That won't be the case when analyzers move out of Lucene entirely, but while they are in Lucene, we'll force everyone to download contrib/analyzers as well.

        I think a single source for all analyzers will be a great step
        forwards for users.

        So maybe we keep in core only Standard, or maybe even something simpler, again, for easy bootstrapping (like Whitespace + lowercase).

        Or remove them entirely (but, then, core tests will need to use
        contrib analyzers for their testing)...

        Show
        Michael McCandless added a comment - Would this mean that after that we can move all of core Analyzers to contrib/analyzers Yes, though, I think that's orthogonal (can and should be separately done, anyway). making one step towards getting them completely out of Lucene and into their own Apache project? We may simply "standardize" on contrib/analyzers as the one place, instead of a new [sub-] project. To be discussed... but we really do need one place. That way, we can keep in core only the AttributeSource and accompanying classes, and really allow people to pass AttributeSource which is not even an Analyzer (like you said). We can move the specific Analyzer tests to contrib/analyzers as well. The other tests in core, who don't care about analysis, can use a src/test specific AttributeSource, like TestAttributeSourceImpl ... Right. I'm thinking - it's ok for contrib to depend on core but not the other way around. I agree. It will however take out of core a useful feature for new users which allows fast bootstrap. Well.. I suspect with this change users would not typically use lucene-core alone. Ie, they'd get analyzers and queryparser (if we also move it out as its own module). That won't be the case when analyzers move out of Lucene entirely, but while they are in Lucene, we'll force everyone to download contrib/analyzers as well. I think a single source for all analyzers will be a great step forwards for users. So maybe we keep in core only Standard, or maybe even something simpler, again, for easy bootstrapping (like Whitespace + lowercase). Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)...
        Hide
        Robert Muir added a comment -

        Or remove them entirely (but, then, core tests will need to use
        contrib analyzers for their testing)...

        I agree, lets not get caught up on how our tests run from build.xml!
        We should decouple analysis from IW as much as possible, at least to support
        more flexible analysis: e.g. someone doesnt want to use the TokenStream
        concept at all, for example.

        I don't really have any opinion practically where all the analyzers go, but I do agree
        it would be nice if they were in one place. For example, in contrib/analyzers now
        we have analyzers by language, and in most cases, users should really be looking
        at EnglishAnalyzer as their "default" instead of StandardAnalyzer for English language,
        as it does Porter stemming, too.

        Show
        Robert Muir added a comment - Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)... I agree, lets not get caught up on how our tests run from build.xml! We should decouple analysis from IW as much as possible, at least to support more flexible analysis: e.g. someone doesnt want to use the TokenStream concept at all, for example. I don't really have any opinion practically where all the analyzers go, but I do agree it would be nice if they were in one place. For example, in contrib/analyzers now we have analyzers by language, and in most cases, users should really be looking at EnglishAnalyzer as their "default" instead of StandardAnalyzer for English language, as it does Porter stemming, too.
        Hide
        Shai Erera added a comment -

        Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)...

        For that I proposed to have a default TestAttributeSourceImpl, which does whitespace tokenization or something. If other 'core' tests need something else, we can write specific AttributeSources for them. I hope we can avoid introducing any dependency of core on contrib.

        Show
        Shai Erera added a comment - Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)... For that I proposed to have a default TestAttributeSourceImpl, which does whitespace tokenization or something. If other 'core' tests need something else, we can write specific AttributeSources for them. I hope we can avoid introducing any dependency of core on contrib.
        Hide
        Robert Muir added a comment -

        For that I proposed to have a default TestAttributeSourceImpl

        We need a bit more than AttributeSource, at least if the text has
        more than one token, it must at least support incrementToken()

        We could try factoring out incrementToken() and end() from
        TokenStream to create a "more-generic" interface, but really,
        there isn't much more to Tokenstream (except close and reset)

        At the same time, while I really like the decorator API of
        TokenStream, it should be easier for someone to use a completely
        different API, perhaps one that feels less like you are writing
        a finite-state machine by hand (capture/restoreState, etc)

        Show
        Robert Muir added a comment - For that I proposed to have a default TestAttributeSourceImpl We need a bit more than AttributeSource, at least if the text has more than one token, it must at least support incrementToken() We could try factoring out incrementToken() and end() from TokenStream to create a "more-generic" interface, but really, there isn't much more to Tokenstream (except close and reset) At the same time, while I really like the decorator API of TokenStream, it should be easier for someone to use a completely different API, perhaps one that feels less like you are writing a finite-state machine by hand (capture/restoreState, etc)
        Hide
        Michael McCandless added a comment -

        Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)...

        For that I proposed to have a default TestAttributeSourceImpl, which does whitespace tokenization or something. If other 'core' tests need something else, we can write specific AttributeSources for them. I hope we can avoid introducing any dependency of core on contrib.

        Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence.

        Show
        Michael McCandless added a comment - Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)... For that I proposed to have a default TestAttributeSourceImpl, which does whitespace tokenization or something. If other 'core' tests need something else, we can write specific AttributeSources for them. I hope we can avoid introducing any dependency of core on contrib. Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence.
        Hide
        Shai Erera added a comment -

        Today when I "ant test-core" contrib is not built, and I like it. Also "ant test-backwards" will be affected I think ... I think if core does not depend on contrib, its tests shouldn't also. It's weird if it will.

        Show
        Shai Erera added a comment - Today when I "ant test-core" contrib is not built, and I like it. Also "ant test-backwards" will be affected I think ... I think if core does not depend on contrib, its tests shouldn't also. It's weird if it will.
        Hide
        Simon Willnauer added a comment -

        The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens. The IndexWriter should instead implement an interface or use a class that is called for each successful "incrementToken()" no matter how this increment is implemented.

        I could imagine a really simple interface like

        
        interface AttributeConsumer {
          
          void setAttributeSource(AttributeSource src);
        
          void next();
        
          void end();
        
        }
        

        IW would then pass itself or an istance it uses (DocInverterPerField) to an API expecting such a consumer like:

        field.consume(this);
        

        or something similar. That way we have not dependency on whatever Attribute producer is used. The default implementation is for sure based on an analyzer / tokenstream and alternatives can be exposed via expert API. Even Backwards compatibility could be solved that way easily.

        Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence.

        +1 test dependencies should not block modularization, its just about configuring the classpath though!

        Show
        Simon Willnauer added a comment - The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens. The IndexWriter should instead implement an interface or use a class that is called for each successful "incrementToken()" no matter how this increment is implemented. I could imagine a really simple interface like interface AttributeConsumer { void setAttributeSource(AttributeSource src); void next(); void end(); } IW would then pass itself or an istance it uses (DocInverterPerField) to an API expecting such a consumer like: field.consume( this ); or something similar. That way we have not dependency on whatever Attribute producer is used. The default implementation is for sure based on an analyzer / tokenstream and alternatives can be exposed via expert API. Even Backwards compatibility could be solved that way easily. Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence. +1 test dependencies should not block modularization, its just about configuring the classpath though!
        Hide
        Michael McCandless added a comment -

        The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens.

        [Carrying over discussions on IRC with Chris Male & Uwe...]

        Actually, TokenStream is already AttrSource + incrementing, so it
        seems like the right start...

        However, the .reset() method is redundant from indexer's standpoint –
        ie when indexer calls Field.getTokenStream (say) whatever init'ing /
        reset'ing should already have be done by that method by the time it
        returns the TokenStream.

        Also, .close and .end are redundant – seems like we should only have
        .end (few token streams do anything in .close...). But coalescing
        those two would be a good chunk of work at this point Or maybe we
        make a .finish that simply both by default

        Finally, indexer doesn't really need a Document; it just needs
        something abstract that's provides an iterator over all fields that
        need indexing (and separately, storing).

        Show
        Michael McCandless added a comment - The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens. [Carrying over discussions on IRC with Chris Male & Uwe...] Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... However, the .reset() method is redundant from indexer's standpoint – ie when indexer calls Field.getTokenStream (say) whatever init'ing / reset'ing should already have be done by that method by the time it returns the TokenStream. Also, .close and .end are redundant – seems like we should only have .end (few token streams do anything in .close...). But coalescing those two would be a good chunk of work at this point Or maybe we make a .finish that simply both by default Finally, indexer doesn't really need a Document; it just needs something abstract that's provides an iterator over all fields that need indexing (and separately, storing).
        Hide
        Simon Willnauer added a comment -

        [Carrying over discussions on IRC with Chris Male & Uwe...]

        That make it very hard to participate. I can not afford to read through all IRC stuff and I don't get the chance to participate directly unless I watch IRC constantly. We should really move back to JIRA / devlist for such discussions. There is too much going on in IRC.

        Actually, TokenStream is already AttrSource + incrementing, so it
        seems like the right start...

        But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API?

        Show
        Simon Willnauer added a comment - [Carrying over discussions on IRC with Chris Male & Uwe...] That make it very hard to participate. I can not afford to read through all IRC stuff and I don't get the chance to participate directly unless I watch IRC constantly. We should really move back to JIRA / devlist for such discussions. There is too much going on in IRC. Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API?
        Hide
        Robert Muir added a comment -

        Hello, i commented yesterday but did not receive much feedback, so
        I want to elaborate some more:

        I suppose what I was trying to mention in my earlier comment here:
        https://issues.apache.org/jira/browse/LUCENE-2309?focusedCommentId=12844189&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12844189

        is that while I really like the new TokenStream API, i would prefer
        it if we thought about making this flexible enough to support
        "different paradigms", including perhaps something that looks a lot
        like the old TokenStream API.

        The reason is, I notice a lot of existing code still under this old API,
        and I think that in some cases, perhaps its easier to work with, even
        if you aren't a new user. I definitely think for newer users the old API
        might have some advantages.

        I think its useful to consider supporting such an API, perhaps as an extension
        in contrib/analyzers, even if its not as fast or flexible as the new API,
        perhaps the tradeoff of speed and flexibility would be worth the ease
        for newer users.

        Show
        Robert Muir added a comment - Hello, i commented yesterday but did not receive much feedback, so I want to elaborate some more: I suppose what I was trying to mention in my earlier comment here: https://issues.apache.org/jira/browse/LUCENE-2309?focusedCommentId=12844189&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12844189 is that while I really like the new TokenStream API, i would prefer it if we thought about making this flexible enough to support "different paradigms", including perhaps something that looks a lot like the old TokenStream API. The reason is, I notice a lot of existing code still under this old API, and I think that in some cases, perhaps its easier to work with, even if you aren't a new user. I definitely think for newer users the old API might have some advantages. I think its useful to consider supporting such an API, perhaps as an extension in contrib/analyzers, even if its not as fast or flexible as the new API, perhaps the tradeoff of speed and flexibility would be worth the ease for newer users.
        Hide
        Uwe Schindler added a comment - - edited

        I could imagine a really simple interface like

        During lunch an idea evolved:

        If you look at current DocInverter code, it does not use a consumer-like API. The code just has an add/accept-method that accepts tokens. The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. But still we must have the attribute api and the acceptor (DocInverter) must always "see" the same attribute instances (else much time would be spent to each time call getAttribute(...) for each token, if the accept method would take an AttributeSource).

        The current TokenStream api could get a method taking AttributeAcceptor and simply do a while incrementToken() loop, calling accept() on DocInverter (the AttributeAcceptor). Another approach for users would be to not use the TokenStream API at all and simply call the accept() method for each token on the Acceptor.

        But both approaches still have te problem with the shared attributes. If you want to "record" tokens you have to implement something like my Proxy attributes. Else (as mentioned) above, most time would be spent in getAttribute() calls.

        Show
        Uwe Schindler added a comment - - edited I could imagine a really simple interface like During lunch an idea evolved: If you look at current DocInverter code, it does not use a consumer-like API. The code just has an add/accept-method that accepts tokens. The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. But still we must have the attribute api and the acceptor (DocInverter) must always "see" the same attribute instances (else much time would be spent to each time call getAttribute(...) for each token, if the accept method would take an AttributeSource). The current TokenStream api could get a method taking AttributeAcceptor and simply do a while incrementToken() loop, calling accept() on DocInverter (the AttributeAcceptor). Another approach for users would be to not use the TokenStream API at all and simply call the accept() method for each token on the Acceptor. But both approaches still have te problem with the shared attributes. If you want to "record" tokens you have to implement something like my Proxy attributes. Else (as mentioned) above, most time would be spent in getAttribute() calls.
        Hide
        Michael McCandless added a comment -

        The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor.

        This is interesting! It inverts the stack/control flow, but, would continue to use shared attrs.

        So then somehow the indexer would pass its AttrAcceptor to the field? And the field would have whatever control logic it wants to feed the tokens...

        Show
        Michael McCandless added a comment - The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. This is interesting! It inverts the stack/control flow, but, would continue to use shared attrs. So then somehow the indexer would pass its AttrAcceptor to the field? And the field would have whatever control logic it wants to feed the tokens...
        Hide
        Michael McCandless added a comment -

        Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start...

        But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API?

        True, but we need at least some way to increment? AttrSource doesn't have that.

        But I don't think we need reset nor close from TokenStream.

        Maybe we could factor out an abstract class / interface that TokenStream impls, minus the reset & close methods?

        Then people could freely use Lucene to index off a foreign analysis chain...

        Show
        Michael McCandless added a comment - Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API? True, but we need at least some way to increment? AttrSource doesn't have that. But I don't think we need reset nor close from TokenStream. Maybe we could factor out an abstract class / interface that TokenStream impls, minus the reset & close methods? Then people could freely use Lucene to index off a foreign analysis chain...
        Hide
        Shai Erera added a comment -

        We should really move back to JIRA / devlist for such discussions

        +1 !! I also find it very hard to track so many sources of discussions (JIRA, java-dev, java-user, general, and now IRC). Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google.

        I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity.

        And having the Field control the flow of indexing seems also dangerous ... might expose Lucene to lots of bugs by users. Today when IW controls it, it's one place to look for, but tomorrow when Field will control it, where do we look? In the app's custom Field code? In IW's atts consuming methods?

        Will the Field also control how stored fields are added? Or only AttributeSourced ones?

        Maybe I need to get used to this change, but currently it looks wrong to reverse the control flow. Maybe in principle the DocInverter now accepts tokens from IW, and so it looks as if we can pass it to the Field (as IW's AttAcceptor), but still the concept is different. We (IW) control the indexing flow, and not the user.

        I also may not understand what will that give to users. Shouldn't users get enough flexibility w/ the current API and the Flex (once out) stuff? Do they really need to be bothered w/ pushing tokens to IW?

        Show
        Shai Erera added a comment - We should really move back to JIRA / devlist for such discussions +1 !! I also find it very hard to track so many sources of discussions (JIRA, java-dev, java-user, general, and now IRC). Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google. I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. And having the Field control the flow of indexing seems also dangerous ... might expose Lucene to lots of bugs by users. Today when IW controls it, it's one place to look for, but tomorrow when Field will control it, where do we look? In the app's custom Field code? In IW's atts consuming methods? Will the Field also control how stored fields are added? Or only AttributeSourced ones? Maybe I need to get used to this change, but currently it looks wrong to reverse the control flow. Maybe in principle the DocInverter now accepts tokens from IW, and so it looks as if we can pass it to the Field (as IW's AttAcceptor), but still the concept is different. We (IW) control the indexing flow, and not the user. I also may not understand what will that give to users. Shouldn't users get enough flexibility w/ the current API and the Flex (once out) stuff? Do they really need to be bothered w/ pushing tokens to IW?
        Hide
        Uwe Schindler added a comment -

        I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity.

        The idea was not to change this behaviour, but also give the user the posibility to reverse that. For some tokenstreams it would simplify things much. The current IndexWriter code works exactly like that:

        1. DocInverter gets TokenStream
        2. DocInverter calls reset() – to be left out and moved to field/analyzer
        3. DocInverter does while-loop on incrementToken - it iterates. On each Token it calls add() on the field consumer
        4. DocInverter calls end() and updates end offset
        5. DocInverter calls close() – to be left out and moved to field/analyzer

        The change is simply that step (3) is removed from DocInverter which only provides the add() method for accepting Tokens. The current while loop simply is done in the current TokenStream/Field code, so nobody needs to change his code. But somebody that actively wants to push tokens can now do this. If he wants to do this currently he has no chance without heavy buffering.

        So the push API will be very expert and the current TokenStreams is just a user of this API.

        Show
        Uwe Schindler added a comment - I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. The idea was not to change this behaviour, but also give the user the posibility to reverse that. For some tokenstreams it would simplify things much. The current IndexWriter code works exactly like that: DocInverter gets TokenStream DocInverter calls reset() – to be left out and moved to field/analyzer DocInverter does while-loop on incrementToken - it iterates. On each Token it calls add() on the field consumer DocInverter calls end() and updates end offset DocInverter calls close() – to be left out and moved to field/analyzer The change is simply that step (3) is removed from DocInverter which only provides the add() method for accepting Tokens. The current while loop simply is done in the current TokenStream/Field code, so nobody needs to change his code. But somebody that actively wants to push tokens can now do this. If he wants to do this currently he has no chance without heavy buffering. So the push API will be very expert and the current TokenStreams is just a user of this API.
        Hide
        Mark Miller added a comment -

        Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google.

        Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a great way for people to communicate and hash stuff out, but its not necessary you follow it. If you have questions or want further elaboration, just ask. No one can expect you to follow IRC, nor is it a valid reference for where something was decided. IRC is great - I think its really benefited having devs discuss there - but the official position is, if it didn't happen on the list, it didnt actually happen.

        Show
        Mark Miller added a comment - Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google. Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a great way for people to communicate and hash stuff out, but its not necessary you follow it. If you have questions or want further elaboration, just ask. No one can expect you to follow IRC, nor is it a valid reference for where something was decided. IRC is great - I think its really benefited having devs discuss there - but the official position is, if it didn't happen on the list, it didnt actually happen.
        Hide
        Simon Willnauer added a comment -

        Then people could freely use Lucene to index off a foreign analysis chain...

        That is what I was talking about!

        I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity.

        We can surely hide this implementation completely from field. I consider this being similar to Collector where you pass it explicitly to the search method if you want to have a different behavior. Maybe something like a AttributeProducer. I don't think adding this to field makes a lot of sense at all and it is not worth the complexity.

        Will the Field also control how stored fields are added? Or only AttributeSourced ones?

        IMO this is only about inverted fields.

        We (IW) control the indexing flow, and not the user.

        The user only gets the possibility to exchange the analysis chain but not the control flow. The user already can mess around with stuff in incrementToken(), the only thing we change / invert is that the indexer does not know about TokenStreams anymore. it does not change the controlflow though.

        Show
        Simon Willnauer added a comment - Then people could freely use Lucene to index off a foreign analysis chain... That is what I was talking about! I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. We can surely hide this implementation completely from field. I consider this being similar to Collector where you pass it explicitly to the search method if you want to have a different behavior. Maybe something like a AttributeProducer. I don't think adding this to field makes a lot of sense at all and it is not worth the complexity. Will the Field also control how stored fields are added? Or only AttributeSourced ones? IMO this is only about inverted fields. We (IW) control the indexing flow, and not the user. The user only gets the possibility to exchange the analysis chain but not the control flow. The user already can mess around with stuff in incrementToken(), the only thing we change / invert is that the indexer does not know about TokenStreams anymore. it does not change the controlflow though.
        Hide
        Uwe Schindler added a comment -

        There is one problem that cannot be easy solved (for all proposals here), if we want to provide an old-style API that does not require reuse of tokens:
        The problem with AttributeProvider is that if we want to support something (like rmuir proposed before) that looks like the old "Token next()", we need an AttributeProvider that passes the AttributeSource to the indexer on each Token! And that would lead to lots of getAttribute() calls, that would slowdown indexing! So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact. This can only be solved with my nice BCEL proxy Attributes, so you can exchange the inner attribute impl. Or do it like TokenWrapper in 2.9 (yes, we can reactivate that API somehow as an easy use-addendum).

        Show
        Uwe Schindler added a comment - There is one problem that cannot be easy solved (for all proposals here), if we want to provide an old-style API that does not require reuse of tokens: The problem with AttributeProvider is that if we want to support something (like rmuir proposed before) that looks like the old "Token next()", we need an AttributeProvider that passes the AttributeSource to the indexer on each Token! And that would lead to lots of getAttribute() calls, that would slowdown indexing! So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact. This can only be solved with my nice BCEL proxy Attributes, so you can exchange the inner attribute impl. Or do it like TokenWrapper in 2.9 (yes, we can reactivate that API somehow as an easy use-addendum).
        Hide
        Robert Muir added a comment -

        So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact.

        I agree. I guess I'll try to simplifiy my concern: maybe we don't necessarily
        need something that looks like the old TokenStream API, but I feel it would
        be worth our time to think about supporting 'some alternative API' that makes
        it easier to work with lots of context across different Tokens.

        I personally do not mind how this is done with the capture/restore state API,
        but I feel that its pretty unnatural for many developers, and in the future folks
        might want to do more complex analysis (maybe even light pos-tagging, etc)
        that requires said context, and we should plan for this.

        I feel this wasn't such an issue with the old TokenStream API, but maybe there
        is another way to address this potential problem.

        Show
        Robert Muir added a comment - So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact. I agree. I guess I'll try to simplifiy my concern: maybe we don't necessarily need something that looks like the old TokenStream API, but I feel it would be worth our time to think about supporting 'some alternative API' that makes it easier to work with lots of context across different Tokens. I personally do not mind how this is done with the capture/restore state API, but I feel that its pretty unnatural for many developers, and in the future folks might want to do more complex analysis (maybe even light pos-tagging, etc) that requires said context, and we should plan for this. I feel this wasn't such an issue with the old TokenStream API, but maybe there is another way to address this potential problem.
        Hide
        Chris Male added a comment -

        I'd like to pick this issue up and run with it. Anyone have any new thoughts?

        Show
        Chris Male added a comment - I'd like to pick this issue up and run with it. Anyone have any new thoughts?
        Hide
        Michael McCandless added a comment -

        Great!

        This will overlap w/ the field type work (we have branch for this now), where we already have decoupled indexer from concrete Field/Document impls, by adding a minimal IndexableField.

        I think this issue should further that, ie pare back IndexableField so that there's only a getTokenStream for indexing (ie indexer will no longer try for String then Reader then tokenStream), and Analyzer must move to the FieldType and not be passed to IndexWriterConfig. Multi-valued fields will be tricky, since IW now asks analyzer for the gaps...

        Show
        Michael McCandless added a comment - Great! This will overlap w/ the field type work (we have branch for this now), where we already have decoupled indexer from concrete Field/Document impls, by adding a minimal IndexableField. I think this issue should further that, ie pare back IndexableField so that there's only a getTokenStream for indexing (ie indexer will no longer try for String then Reader then tokenStream), and Analyzer must move to the FieldType and not be passed to IndexWriterConfig. Multi-valued fields will be tricky, since IW now asks analyzer for the gaps...
        Hide
        Michael McCandless added a comment -

        Once Analyzer moves onto FieldTypes, I think we can deprecate/remove PerFieldAnalyzerWrapper.

        I'm not sure what to do about multi-valued fields; we may have to move the getPosIncr/OffsetGap onto IndexableField, since it's not easy to ask a TokenStream to do this shifting for us?

        Show
        Michael McCandless added a comment - Once Analyzer moves onto FieldTypes, I think we can deprecate/remove PerFieldAnalyzerWrapper. I'm not sure what to do about multi-valued fields; we may have to move the getPosIncr/OffsetGap onto IndexableField, since it's not easy to ask a TokenStream to do this shifting for us?
        Hide
        Michael McCandless added a comment -

        Hmm, with this cutover comes a new spooky degree of freedom: the ability to specify a different analyzer for each value of a multi-value'd field.

        Maybe we need an explicit FieldType that is "multi-valued", and so the document would have only one instance of Field for that field name, and within that Field are multiple values? Problem is, this would lose a degree of freedom we have today, ie the ability for different values of the same field name to have wildly different types...

        Show
        Michael McCandless added a comment - Hmm, with this cutover comes a new spooky degree of freedom: the ability to specify a different analyzer for each value of a multi-value'd field. Maybe we need an explicit FieldType that is "multi-valued", and so the document would have only one instance of Field for that field name, and within that Field are multiple values? Problem is, this would lose a degree of freedom we have today, ie the ability for different values of the same field name to have wildly different types...
        Hide
        Erick Erickson added a comment -

        <<<Hmm, with this cutover comes a new spooky degree of freedom: the ability to specify a different analyzer for each value of a multi-value'd field.>>>

        Call me a coward, but this scares me! Answering the question "why didn't I get the results I was expecting" would become...er...somewhat more difficult (I'm studying the understatement thing). Although I guess this wouldn't be accessible from Solr, so maybe it would be another one of those "expert" features?

        Show
        Erick Erickson added a comment - <<<Hmm, with this cutover comes a new spooky degree of freedom: the ability to specify a different analyzer for each value of a multi-value'd field.>>> Call me a coward, but this scares me! Answering the question "why didn't I get the results I was expecting" would become...er...somewhat more difficult (I'm studying the understatement thing). Although I guess this wouldn't be accessible from Solr, so maybe it would be another one of those "expert" features?
        Hide
        Mike Sokolov added a comment -

        Would there be any valid use for that "feature" that couldn't be accomplished in some more straightforward manner? It would be like a union, maybe: a field that is sometimes fish and other times fowl, or just a menagerie? But I can always create a group of fields of various types (in an application) that work together?

        Show
        Mike Sokolov added a comment - Would there be any valid use for that "feature" that couldn't be accomplished in some more straightforward manner? It would be like a union, maybe: a field that is sometimes fish and other times fowl, or just a menagerie? But I can always create a group of fields of various types (in an application) that work together?
        Hide
        Chris Male added a comment -

        Note, this patch is against the FieldType branch and is very, very, VERY POC patch.

        • Adds the ability to set Analyzer on FieldType (I haven't removed it from IWC and haven't addressed the OffSetGap issue)
        • Adds AttributeConsumer abstraction which has callbacks like Simon described above.
        • Added consume(AttributeConsumer) to Field, AttributeSource and TokenStream. TokenStream.consume() handles calling reset, incrementToken, end and close. AttributeSource provides a simple implementation. Field.consume() instantiates a TokenStream from the Analyzer in the FieldType (or uses an an existing one) and then passes on the consume call.
        • DocInverterPerField now uses AttributeConsumer when the Field is tokenized.
        • ReusableStringReader has been quickly made public so it can be used in Field
        • I changed one test (TestDemo) to have the Analyzer set on the FieldType.

        Very POC!

        Show
        Chris Male added a comment - Note, this patch is against the FieldType branch and is very, very, VERY POC patch. Adds the ability to set Analyzer on FieldType (I haven't removed it from IWC and haven't addressed the OffSetGap issue) Adds AttributeConsumer abstraction which has callbacks like Simon described above. Added consume(AttributeConsumer) to Field, AttributeSource and TokenStream. TokenStream.consume() handles calling reset, incrementToken, end and close. AttributeSource provides a simple implementation. Field.consume() instantiates a TokenStream from the Analyzer in the FieldType (or uses an an existing one) and then passes on the consume call. DocInverterPerField now uses AttributeConsumer when the Field is tokenized. ReusableStringReader has been quickly made public so it can be used in Field I changed one test (TestDemo) to have the Analyzer set on the FieldType. Very POC!
        Hide
        Robert Muir added a comment -

        haven't addressed the OffSetGap issue

        I actually think these gaps are what we should address first. here's a rough idea:

        • remove offset/position increment gap from Analyzer.
        • instead for multivalued fields, the field handles this internally. so it returns a MultiValuedTokenstream? that does the 'concatenation'/offset/position increasing between fields itself. IndexWriter just sees one tokenstream for the field and doesn't know about this, e.g. it just consumes positions and offsets.

        To do this, I think there could be problems if the analyzer does not reuse, as it should be one set of attributes to the indexer across the multivalued field.

        so first to solve this problem: I think first we should remove Analyzer.tokenStream so all analyzers are reusable, and push ReusableAnalyzerBase's API down into Analyzer. We want to do this improvement anyway to solve that trap.

        Show
        Robert Muir added a comment - haven't addressed the OffSetGap issue I actually think these gaps are what we should address first. here's a rough idea: remove offset/position increment gap from Analyzer. instead for multivalued fields, the field handles this internally. so it returns a MultiValuedTokenstream? that does the 'concatenation'/offset/position increasing between fields itself. IndexWriter just sees one tokenstream for the field and doesn't know about this, e.g. it just consumes positions and offsets. To do this, I think there could be problems if the analyzer does not reuse, as it should be one set of attributes to the indexer across the multivalued field. so first to solve this problem: I think first we should remove Analyzer.tokenStream so all analyzers are reusable, and push ReusableAnalyzerBase's API down into Analyzer. We want to do this improvement anyway to solve that trap.
        Hide
        Chris Male added a comment -

        While I digest your suggestion on handling the gap values,

        so first to solve this problem: I think first we should remove Analyzer.tokenStream so all analyzers are reusable, and push ReusableAnalyzerBase's API down into Analyzer. We want to do this improvement anyway to solve that trap.

        Beyond the heavy lifting of doing this (which I'm fine with doing), do you know off hand whether this is going to be a problem for any of the Analyzers/Tokenizer/TokenFilter impls we have? I seem to recall an issue where you looked into this.

        Show
        Chris Male added a comment - While I digest your suggestion on handling the gap values, so first to solve this problem: I think first we should remove Analyzer.tokenStream so all analyzers are reusable, and push ReusableAnalyzerBase's API down into Analyzer. We want to do this improvement anyway to solve that trap. Beyond the heavy lifting of doing this (which I'm fine with doing), do you know off hand whether this is going to be a problem for any of the Analyzers/Tokenizer/TokenFilter impls we have? I seem to recall an issue where you looked into this.
        Hide
        Robert Muir added a comment -

        Yes, so my idea is basically that Analyzer gets ReusableAnalyzerBase's API completely, so its reusableTokenStream() is final.

        In order for this to work, we need to first fix the limitation of ReusableAnalyzerBase:

         * ReusableAnalyzerBase is a simplification of Analyzer that supports easy reuse
         * for the most common use-cases. Analyzers such as
         * {@link PerFieldAnalyzerWrapper} that behave differently depending upon the
         * field name need to subclass Analyzer directly instead.
        

        this looks easy to do, the current implementation always puts a TokenStreamComponents into the 'previousTokenStream'. Instead, reusableAnalyzerBase can have an optional ctor with something like ReuseStrategy, of which there are two:

        • GLOBAL: this is what it does now, reuse across all fields, and should be the default
        • PERFIELD: this one instead puts a Map<String,TokenStreamComponents> into the previous tokenstream and reuses on a per-field basis.

        then the problem is solved, and we could name this thing Analyzer, make all analyzers extend it directly, remove the non-reusable TokenStream(), it would only have a final reusableTokenStream() for the consumers.

        Show
        Robert Muir added a comment - Yes, so my idea is basically that Analyzer gets ReusableAnalyzerBase's API completely, so its reusableTokenStream() is final. In order for this to work, we need to first fix the limitation of ReusableAnalyzerBase: * ReusableAnalyzerBase is a simplification of Analyzer that supports easy reuse * for the most common use-cases. Analyzers such as * {@link PerFieldAnalyzerWrapper} that behave differently depending upon the * field name need to subclass Analyzer directly instead. this looks easy to do, the current implementation always puts a TokenStreamComponents into the 'previousTokenStream'. Instead, reusableAnalyzerBase can have an optional ctor with something like ReuseStrategy, of which there are two: GLOBAL: this is what it does now, reuse across all fields, and should be the default PERFIELD: this one instead puts a Map<String,TokenStreamComponents> into the previous tokenstream and reuses on a per-field basis. then the problem is solved, and we could name this thing Analyzer, make all analyzers extend it directly, remove the non-reusable TokenStream(), it would only have a final reusableTokenStream() for the consumers.
        Hide
        Chris Male added a comment -

        Fortunately, the patch I put up means PerFieldAnalyzerWrapper can be removed (since Analyzer becomes per FieldType). But I take your point and it will be necessary to implement what you describe to support SolrAnalyzer.

        Show
        Chris Male added a comment - Fortunately, the patch I put up means PerFieldAnalyzerWrapper can be removed (since Analyzer becomes per FieldType). But I take your point and it will be necessary to implement what you describe to support SolrAnalyzer.
        Hide
        Robert Muir added a comment -

        Fortunately, the patch I put up means PerFieldAnalyzerWrapper can be removed (since Analyzer becomes per FieldType).

        How does this work with QueryParser and other consumers of Analyzer though?

        Show
        Robert Muir added a comment - Fortunately, the patch I put up means PerFieldAnalyzerWrapper can be removed (since Analyzer becomes per FieldType). How does this work with QueryParser and other consumers of Analyzer though?
        Hide
        Chris Male added a comment -

        Good point. I guess it will continue to have a place in those instances yes.

        Show
        Chris Male added a comment - Good point. I guess it will continue to have a place in those instances yes.
        Hide
        Robert Muir added a comment -

        I'm just asking the question because, maybe i have a different idea of "fully decouple IndexWriter from Analyzer".

        If we setup an API where its different at indextime versus query-time, I'm not going to be happy because this is setting up users to fail: in general i should be able to pass my 'analysis configuration' to all the various consumers and if its the same at index/time versus query/time, things should work.

        I dont think we should expose a different per-field configuration at index-time versus query-time versus this or that, I think thats a step backwards.

        Show
        Robert Muir added a comment - I'm just asking the question because, maybe i have a different idea of "fully decouple IndexWriter from Analyzer". If we setup an API where its different at indextime versus query-time, I'm not going to be happy because this is setting up users to fail: in general i should be able to pass my 'analysis configuration' to all the various consumers and if its the same at index/time versus query/time, things should work. I dont think we should expose a different per-field configuration at index-time versus query-time versus this or that, I think thats a step backwards.
        Hide
        Chris Male added a comment -

        You make good points and I don't have answers for you at this stage.

        What I'm exploring with this direction currently is how best to consume the terms of a Field while minimizing the exposure of Analyzer / TokenStream in the indexing process. What has felt nature to me is having Analyzer at the Field level. This is already kind of implemented anyway - if a Field returns a tokenStreamValue() then thats used for indexing, no matter what the 'analysis configuration' is.

        Do you have any suggestions for other directions to follow?

        Show
        Chris Male added a comment - You make good points and I don't have answers for you at this stage. What I'm exploring with this direction currently is how best to consume the terms of a Field while minimizing the exposure of Analyzer / TokenStream in the indexing process. What has felt nature to me is having Analyzer at the Field level. This is already kind of implemented anyway - if a Field returns a tokenStreamValue() then thats used for indexing, no matter what the 'analysis configuration' is. Do you have any suggestions for other directions to follow?
        Hide
        Robert Muir added a comment -

        What I'm exploring with this direction currently is how best to consume the terms of a Field while minimizing the exposure of Analyzer / TokenStream in the indexing process.

        I thought about this a lot, I'm not sure we need to minimize TokenStream, I think it might be fine! There is really nothing much more minimal than this, its just attributesource + incrementToken() + reset() + end() + close()...

        This issue is a little out of date, I actually think we have been decoupling indexwriter from analysis even more... for example the indexwriter just pulls bytes from the tokenstream, etc.

        I think we have been going in the right direction and should just be factoring out the things that don't belong in the indexer or in analyzer (like this gap manipulation)

        What has felt nature to me is having Analyzer at the Field level. This is already kind of implemented anyway - if a Field returns a tokenStreamValue() then thats used for indexing, no matter what the 'analysis configuration' is.

        Yeah but thats currently an expert option, not the normal case. In general if we are saying IndexWriter doesn't take Analyzer but has some schema that is a list of Fields, then QueryParser etc need to take this list of fields also, so that queryparsing is consistent with analysis. But i'm not sure I like this either: I think it only couples QueryParser with IndexWriter.

        Do you have any suggestions for other directions to follow?

        I don't think we should try to do a huge mega-change right now given that there are various "problems" we should fix... some of this I know is my pet peeves but:

        • fixing the Analyzers to only be reusable is important, its a performance trap. we should do this regardless... and we can even backport this one easily to 3.x (backporting improvements to reusableAnalyzerBase, deprecating tokenStream(), etc)
        • removing the positionIncrement/offsetGaps from Analyzer makes total sense to me, this is "decoupling indexwriter from analyzer" because these gaps make no sense to other analyzer consumers. So I think these gaps are in the wrong place, and should instead be in Field or whatever, which creates ConcatenatedTokenStream behind the scenes to provide to IndexWriter. This is also good too, maybe you need to do other things than munge offsets/positions across these gaps and this concatenation would no longer be hardcoded in indexwriter.

        I've looked at doing e.g. the first of these and I know its a huge mondo pain in the ass, these "smaller" changes are really hard and a ton of work.

        Show
        Robert Muir added a comment - What I'm exploring with this direction currently is how best to consume the terms of a Field while minimizing the exposure of Analyzer / TokenStream in the indexing process. I thought about this a lot, I'm not sure we need to minimize TokenStream, I think it might be fine! There is really nothing much more minimal than this, its just attributesource + incrementToken() + reset() + end() + close()... This issue is a little out of date, I actually think we have been decoupling indexwriter from analysis even more... for example the indexwriter just pulls bytes from the tokenstream, etc. I think we have been going in the right direction and should just be factoring out the things that don't belong in the indexer or in analyzer (like this gap manipulation) What has felt nature to me is having Analyzer at the Field level. This is already kind of implemented anyway - if a Field returns a tokenStreamValue() then thats used for indexing, no matter what the 'analysis configuration' is. Yeah but thats currently an expert option, not the normal case. In general if we are saying IndexWriter doesn't take Analyzer but has some schema that is a list of Fields, then QueryParser etc need to take this list of fields also, so that queryparsing is consistent with analysis. But i'm not sure I like this either: I think it only couples QueryParser with IndexWriter. Do you have any suggestions for other directions to follow? I don't think we should try to do a huge mega-change right now given that there are various "problems" we should fix... some of this I know is my pet peeves but: fixing the Analyzers to only be reusable is important, its a performance trap. we should do this regardless... and we can even backport this one easily to 3.x (backporting improvements to reusableAnalyzerBase, deprecating tokenStream(), etc) removing the positionIncrement/offsetGaps from Analyzer makes total sense to me, this is "decoupling indexwriter from analyzer" because these gaps make no sense to other analyzer consumers. So I think these gaps are in the wrong place, and should instead be in Field or whatever, which creates ConcatenatedTokenStream behind the scenes to provide to IndexWriter. This is also good too, maybe you need to do other things than munge offsets/positions across these gaps and this concatenation would no longer be hardcoded in indexwriter. I've looked at doing e.g. the first of these and I know its a huge mondo pain in the ass, these "smaller" changes are really hard and a ton of work.
        Hide
        Uwe Schindler added a comment -

        I like the idea here, that is "hey analyzer, give me your tokens!". For lots of use-cases this is much easier to implement than TokenStreams. Its just a change from pull to more push tokens. The impl in TokenStream is just consuming insself and pushing the tokens. From the abstraction point of view thats much easier to understand.

        Show
        Uwe Schindler added a comment - I like the idea here, that is "hey analyzer, give me your tokens!". For lots of use-cases this is much easier to implement than TokenStreams. Its just a change from pull to more push tokens. The impl in TokenStream is just consuming insself and pushing the tokens. From the abstraction point of view thats much easier to understand.
        Hide
        Robert Muir added a comment -

        And that might be a good way, my point is if this is how we want to go, then Analyzer should instead provide AttributeConsumer, and the queryparsers etc should consume it with the same API (and tokenstream is an implementation detail behind the scenes).

        But i don't think queryparser should take PerFieldAnalyzerWrapper while IndexWriter takes per-Field analyzers, I think thats confusing.

        Show
        Robert Muir added a comment - And that might be a good way, my point is if this is how we want to go, then Analyzer should instead provide AttributeConsumer, and the queryparsers etc should consume it with the same API (and tokenstream is an implementation detail behind the scenes). But i don't think queryparser should take PerFieldAnalyzerWrapper while IndexWriter takes per-Field analyzers, I think thats confusing.
        Hide
        Uwe Schindler added a comment -

        I think Chris only started with the indexer as an example to show that it works. Of cource we can rewrite all other consumers to use this new api. Also BaseTSTestCase

        Show
        Uwe Schindler added a comment - I think Chris only started with the indexer as an example to show that it works. Of cource we can rewrite all other consumers to use this new api. Also BaseTSTestCase
        Hide
        Robert Muir added a comment -

        well if thats the direction here, then we should describe the jira issue differently: something like "abstract away TokenStream API".

        because it just looks to me as if IndexWriter works off a different analysis API than other analysis consumers and I don't like that.

        Show
        Robert Muir added a comment - well if thats the direction here, then we should describe the jira issue differently: something like "abstract away TokenStream API". because it just looks to me as if IndexWriter works off a different analysis API than other analysis consumers and I don't like that.
        Hide
        Chris Male added a comment -

        I just want to re-iterate the point that I'm just exploring options here. I'm really glad to be getting feedback but no direction has been set in stone.

        I thought about this a lot, I'm not sure we need to minimize TokenStream, I think it might be fine! There is really nothing much more minimal than this, its just attributesource + incrementToken() + reset() + end() + close()...

        Okay interesting. I admit I don't really have any concrete thoughts on an 'alternative' to TokenStream, but wouldn't you agree that exposing a more limited API to the Indexer is beneficial here? I agree we're only talking about a handful of methods currently.

        Yeah but thats currently an expert option, not the normal case. In general if we are saying IndexWriter doesn't take Analyzer but has some schema that is a list of Fields, then QueryParser etc need to take this list of fields also, so that queryparsing is consistent with analysis. But i'm not sure I like this either: I think it only couples QueryParser with IndexWriter.

        Two things that jump out at me here. Firstly, IndexWriter does take a list of Fields, the Fields it indexes in each Document. We've also identified that people might want to apply different analysis per field, thats why we have PerFieldAnalyzerWrapper isn't it?

        Secondly, we now have multiple QueryParsing implementations consolidated together. I hope over time we'll add more / different implementations which do things differently. So while I totally agree with the sentiment that we shouldn't make this confusing for users and that searching and indexing should work together, I'm certainly open to exploring other ways to do that.

        I've looked at doing e.g. the first of these and I know its a huge mondo pain in the ass, these "smaller" changes are really hard and a ton of work.

        Independent of what's decided here, I'm definitely happy to do the heavy lifting on the improvements you've suggested.

        Show
        Chris Male added a comment - I just want to re-iterate the point that I'm just exploring options here. I'm really glad to be getting feedback but no direction has been set in stone. I thought about this a lot, I'm not sure we need to minimize TokenStream, I think it might be fine! There is really nothing much more minimal than this, its just attributesource + incrementToken() + reset() + end() + close()... Okay interesting. I admit I don't really have any concrete thoughts on an 'alternative' to TokenStream, but wouldn't you agree that exposing a more limited API to the Indexer is beneficial here? I agree we're only talking about a handful of methods currently. Yeah but thats currently an expert option, not the normal case. In general if we are saying IndexWriter doesn't take Analyzer but has some schema that is a list of Fields, then QueryParser etc need to take this list of fields also, so that queryparsing is consistent with analysis. But i'm not sure I like this either: I think it only couples QueryParser with IndexWriter. Two things that jump out at me here. Firstly, IndexWriter does take a list of Fields, the Fields it indexes in each Document. We've also identified that people might want to apply different analysis per field, thats why we have PerFieldAnalyzerWrapper isn't it? Secondly, we now have multiple QueryParsing implementations consolidated together. I hope over time we'll add more / different implementations which do things differently. So while I totally agree with the sentiment that we shouldn't make this confusing for users and that searching and indexing should work together, I'm certainly open to exploring other ways to do that. I've looked at doing e.g. the first of these and I know its a huge mondo pain in the ass, these "smaller" changes are really hard and a ton of work. Independent of what's decided here, I'm definitely happy to do the heavy lifting on the improvements you've suggested.
        Hide
        Chris Male added a comment -

        But i don't think queryparser should take PerFieldAnalyzerWrapper while IndexWriter takes per-Field analyzers, I think thats confusing.

        Yes it is.

        I think Chris only started with the indexer as an example to show that it works. Of cource we can rewrite all other consumers to use this new api. Also BaseTSTestCase.

        Absolutely.

        well if thats the direction here, then we should describe the jira issue differently: something like "abstract away TokenStream API".

        I don't think what I've implemented in the patch is so different to what has been discussed in this issue earlier. I did consider opening another issue, but I thought this JIRA issue captured the conceptual issue quite well.

        because it just looks to me as if IndexWriter works off a different analysis API than other analysis consumers and I don't like that.

        I'm happy to explore those other consumers and strive to provide a user friend API to limit bugs. But I'm not getting the impression you like the concept at all.

        Show
        Chris Male added a comment - But i don't think queryparser should take PerFieldAnalyzerWrapper while IndexWriter takes per-Field analyzers, I think thats confusing. Yes it is. I think Chris only started with the indexer as an example to show that it works. Of cource we can rewrite all other consumers to use this new api. Also BaseTSTestCase. Absolutely. well if thats the direction here, then we should describe the jira issue differently: something like "abstract away TokenStream API". I don't think what I've implemented in the patch is so different to what has been discussed in this issue earlier. I did consider opening another issue, but I thought this JIRA issue captured the conceptual issue quite well. because it just looks to me as if IndexWriter works off a different analysis API than other analysis consumers and I don't like that. I'm happy to explore those other consumers and strive to provide a user friend API to limit bugs. But I'm not getting the impression you like the concept at all.
        Hide
        Robert Muir added a comment -

        I'm happy to explore those other consumers and strive to provide a user friend API to limit bugs. But I'm not getting the impression you like the concept at all.

        That's totally not it, but I do like the fact of having an "Analyzer" that is the central API for analysis that anything uses: IndexWriter, QueryParser, MoreLikeThis, Synonyms parsing, SpellChecking, or wherever we need it...

        I think if we want this to take AttributesConsumer or whatever, then thats cool, Analyzer returns this instead of TokenStream and we fix all these consumers to consume the more general API.

        I just want to make sure, that all consumers, not just IndexWriter, use the consistent API. This way, like today, someone declares FooAnalyzer, uses it everywhere, and stuff is consistent everywhere.

        Show
        Robert Muir added a comment - I'm happy to explore those other consumers and strive to provide a user friend API to limit bugs. But I'm not getting the impression you like the concept at all. That's totally not it, but I do like the fact of having an "Analyzer" that is the central API for analysis that anything uses: IndexWriter, QueryParser, MoreLikeThis, Synonyms parsing, SpellChecking, or wherever we need it... I think if we want this to take AttributesConsumer or whatever, then thats cool, Analyzer returns this instead of TokenStream and we fix all these consumers to consume the more general API. I just want to make sure, that all consumers, not just IndexWriter, use the consistent API. This way, like today, someone declares FooAnalyzer, uses it everywhere, and stuff is consistent everywhere.
        Hide
        Chris Male added a comment - - edited

        I think if we want this to take AttributesConsumer or whatever, then thats cool, Analyzer returns this instead of TokenStream and we fix all these consumers to consume the more general API.

        I just don't see how this would work. As it is in the patch, AttributeConsumer is a callback mechanism where the consumer provides their logic. Its nothing to do with Analyzers really and will be implemented differently depending on what the consumer wants to do in that instance.

        I just want to make sure, that all consumers, not just IndexWriter, use the consistent API. This way, like today, someone declares FooAnalyzer, uses it everywhere, and stuff is consistent everywhere.

        Absolutely desirable. AttributeConsumer isn't changing the Analyzer concept, its just changing how we consume from Analyzer. With that in mind, I very much agree with your assertion that this shouldn't change the Analyzer used in search and indexing. Whats prompted that concern here is the shift to per Field Analyzer. I'll reassess that change while waiting for other feedback.

        Show
        Chris Male added a comment - - edited I think if we want this to take AttributesConsumer or whatever, then thats cool, Analyzer returns this instead of TokenStream and we fix all these consumers to consume the more general API. I just don't see how this would work. As it is in the patch, AttributeConsumer is a callback mechanism where the consumer provides their logic. Its nothing to do with Analyzers really and will be implemented differently depending on what the consumer wants to do in that instance. I just want to make sure, that all consumers, not just IndexWriter, use the consistent API. This way, like today, someone declares FooAnalyzer, uses it everywhere, and stuff is consistent everywhere. Absolutely desirable. AttributeConsumer isn't changing the Analyzer concept, its just changing how we consume from Analyzer. With that in mind, I very much agree with your assertion that this shouldn't change the Analyzer used in search and indexing. Whats prompted that concern here is the shift to per Field Analyzer. I'll reassess that change while waiting for other feedback.
        Hide
        Robert Muir added a comment -

        I just don't see how this would work.

        As long as it implements an interface that provides:
        void consume(AttributeConsumer attributeConsumer)

        ?

        then, this coudl be what Analyzer returns, Consumable or whatever you want to call this

        Show
        Robert Muir added a comment - I just don't see how this would work. As long as it implements an interface that provides: void consume(AttributeConsumer attributeConsumer) ? then, this coudl be what Analyzer returns, Consumable or whatever you want to call this
        Hide
        Uwe Schindler added a comment -

        To come back to decoupling:
        With the new API, we no longer need NumericTokenStream, as NumericField can simply push the tokens to DocInverter. So TokenStream can move out of core, but NumericField & NumericRangeQuery can stay - nice!

        Show
        Uwe Schindler added a comment - To come back to decoupling: With the new API, we no longer need NumericTokenStream, as NumericField can simply push the tokens to DocInverter. So TokenStream can move out of core, but NumericField & NumericRangeQuery can stay - nice!
        Hide
        Chris Male added a comment -

        Okay, I thought about this overnight and have tried to come up with a middle ground. Again, very proof-of-concept.

        • Analyzer now moves away from exposing TokenStream (although I've left the methods there) and now returns an AttributeSource.
        • Field.consume() now becomes Field.consume(AttributeConsumer, Analyzer). Here, the Analyzer is that passed into IW. This means that the Field can decide how it wants to expose its terms. The default implementation uses the Analyzer, but others can do what they like.
        • I've removed adding Analyzer to FieldType, but it could still be exposed as an expert option.

        The overall idea is that the Fields now control how terms are given to DocInverter.

        Show
        Chris Male added a comment - Okay, I thought about this overnight and have tried to come up with a middle ground. Again, very proof-of-concept. Analyzer now moves away from exposing TokenStream (although I've left the methods there) and now returns an AttributeSource. Field.consume() now becomes Field.consume(AttributeConsumer, Analyzer). Here, the Analyzer is that passed into IW. This means that the Field can decide how it wants to expose its terms. The default implementation uses the Analyzer, but others can do what they like. I've removed adding Analyzer to FieldType, but it could still be exposed as an expert option. The overall idea is that the Fields now control how terms are given to DocInverter.
        Hide
        Michael McCandless added a comment -

        I think there are two rather separate ideas here?

        First, IW should not have to "know" how to get a TokenStream from a
        IndexableField; it should only ask the Field for the token stream and get that
        back and iterate its tokens.

        Under the hood (in the IndexableField impl) is where the logic for
        tokenized or not, Reader vs String vs pre-created token stream,
        etc. should live, instead of hardwired inside indexer. Maybe an app
        has a fully custom way to make a token stream for the field...

        Likewise, for multi-valued fields, IW shouldn't "see" the separate
        values; it should just receive a single token stream, and under the
        hood (in Document/Field impl) it's concatenating separate token
        streams, adding posIncr/offset gaps, etc. This too is now hardwired
        in indexer but shouldn't be. Maybe an app wants to insert custom
        "separator" tokens between the values...

        (And I agree: as a pre-req we need to fix Analyzer to not allow
        non-reused token streams; else we can't concatenate w/o attr
        proxying/copying).

        If IW still receives analyzer and simply passes it through when asking
        for the tokenStream I think that's fine for now. In the future, I
        think IW should not receive analyzer (ie, it should be agnostic to how
        the app creates token streams); rather, each FieldType would hold the
        analyzer for that field. However, that sounds contentious, so let's
        leave it for another day.

        Second, this new idea to "invert" TokenStream into an AttrConsumer,
        which I think is separate? I'm actually not sure I like such an
        approach... it seems more confusing for simple usage? Ie, if I want
        to analyze some text and iterate over the tokens... suddenly, instead
        of a few lines of local code, I have to make a class instance with a
        method that receives each token? It seems more convoluted? I
        mean, for Lucene's limited internal usage of token stream, this is
        fine, but for others who consume token streams... it seems more
        cumbersome.

        Anyway, I think we should open a separate issue for "invert
        TokenStream into AttrConsumer"?

        Show
        Michael McCandless added a comment - I think there are two rather separate ideas here? First, IW should not have to "know" how to get a TokenStream from a IndexableField; it should only ask the Field for the token stream and get that back and iterate its tokens. Under the hood (in the IndexableField impl) is where the logic for tokenized or not, Reader vs String vs pre-created token stream, etc. should live, instead of hardwired inside indexer. Maybe an app has a fully custom way to make a token stream for the field... Likewise, for multi-valued fields, IW shouldn't "see" the separate values; it should just receive a single token stream, and under the hood (in Document/Field impl) it's concatenating separate token streams, adding posIncr/offset gaps, etc. This too is now hardwired in indexer but shouldn't be. Maybe an app wants to insert custom "separator" tokens between the values... (And I agree: as a pre-req we need to fix Analyzer to not allow non-reused token streams; else we can't concatenate w/o attr proxying/copying). If IW still receives analyzer and simply passes it through when asking for the tokenStream I think that's fine for now. In the future, I think IW should not receive analyzer (ie, it should be agnostic to how the app creates token streams); rather, each FieldType would hold the analyzer for that field. However, that sounds contentious, so let's leave it for another day. Second, this new idea to "invert" TokenStream into an AttrConsumer, which I think is separate? I'm actually not sure I like such an approach... it seems more confusing for simple usage? Ie, if I want to analyze some text and iterate over the tokens... suddenly, instead of a few lines of local code, I have to make a class instance with a method that receives each token? It seems more convoluted? I mean, for Lucene's limited internal usage of token stream, this is fine, but for others who consume token streams... it seems more cumbersome. Anyway, I think we should open a separate issue for "invert TokenStream into AttrConsumer"?
        Hide
        Chris Male added a comment -

        I've thought about this issue some more and I feel there's a middle ground to be had.

        First, IW should not have to "know" how to get a TokenStream from a
        IndexableField; it should only ask the Field for the token stream and get that
        back and iterate its tokens.

        You're absolutely right and this should be our first step. It should be up to the Field to produce its terms, IW should just iterate through them.

        Likewise, for multi-valued fields, IW shouldn't "see" the separate
        values; it should just receive a single token stream, and under the
        hood (in Document/Field impl) it's concatenating separate token
        streams, adding posIncr/offset gaps, etc. This too is now hardwired
        in indexer but shouldn't be. Maybe an app wants to insert custom
        "separator" tokens between the values...

        I also totally agree. We should strive to reduce as much hardwiring at make it as flexible as possible. But again I see this as a step in the process.

        Second, this new idea to "invert" TokenStream into an AttrConsumer,
        which I think is separate? I'm actually not sure I like such an
        approach... it seems more confusing for simple usage? Ie, if I want
        to analyze some text and iterate over the tokens... suddenly, instead
        of a few lines of local code, I have to make a class instance with a
        method that receives each token? It seems more convoluted? I
        mean, for Lucene's limited internal usage of token stream, this is
        fine, but for others who consume token streams... it seems more
        cumbersome.

        I don't agree that this is separate. For me the purpose of this issue is to fully decouple IndexWriter from analyzers As such the how IW consumes the terms it indexes is at the heart of the issue. The inversion approach is a suggestion for how we might tackle this in a flexible and extensible way. So I don't see any reason to push it to another issue. Its a way of fulfilling this issue.

        I think there is also some confusion here. I'm not suggesting we change all usage of analysis. If someone wants to consume TokenStream as is, so be it. What I'm looking at changing here is how IW gets the terms it indexes, thats all. We've introduced abstractions like IndexableField to be flexible and extensible. I don't think there's anything wrong with examining the same thing with TokenStream here.

        I think Robert has stated here that he's comfortable continuing to use TokenStream as the API for IW to get the terms it indexes, is that what others feel too? I agree the inverted API I proposed is a little convoluted and I'm sure we can come up with a simple Consumable like abstraction (which Robert did also suggest above). But if people are content with TokenStream then theres no need.

        Show
        Chris Male added a comment - I've thought about this issue some more and I feel there's a middle ground to be had. First, IW should not have to "know" how to get a TokenStream from a IndexableField; it should only ask the Field for the token stream and get that back and iterate its tokens. You're absolutely right and this should be our first step. It should be up to the Field to produce its terms, IW should just iterate through them. Likewise, for multi-valued fields, IW shouldn't "see" the separate values; it should just receive a single token stream, and under the hood (in Document/Field impl) it's concatenating separate token streams, adding posIncr/offset gaps, etc. This too is now hardwired in indexer but shouldn't be. Maybe an app wants to insert custom "separator" tokens between the values... I also totally agree. We should strive to reduce as much hardwiring at make it as flexible as possible. But again I see this as a step in the process. Second, this new idea to "invert" TokenStream into an AttrConsumer, which I think is separate? I'm actually not sure I like such an approach... it seems more confusing for simple usage? Ie, if I want to analyze some text and iterate over the tokens... suddenly, instead of a few lines of local code, I have to make a class instance with a method that receives each token? It seems more convoluted? I mean, for Lucene's limited internal usage of token stream, this is fine, but for others who consume token streams... it seems more cumbersome. I don't agree that this is separate. For me the purpose of this issue is to fully decouple IndexWriter from analyzers As such the how IW consumes the terms it indexes is at the heart of the issue. The inversion approach is a suggestion for how we might tackle this in a flexible and extensible way. So I don't see any reason to push it to another issue. Its a way of fulfilling this issue. I think there is also some confusion here. I'm not suggesting we change all usage of analysis. If someone wants to consume TokenStream as is, so be it. What I'm looking at changing here is how IW gets the terms it indexes, thats all. We've introduced abstractions like IndexableField to be flexible and extensible. I don't think there's anything wrong with examining the same thing with TokenStream here. I think Robert has stated here that he's comfortable continuing to use TokenStream as the API for IW to get the terms it indexes, is that what others feel too? I agree the inverted API I proposed is a little convoluted and I'm sure we can come up with a simple Consumable like abstraction (which Robert did also suggest above). But if people are content with TokenStream then theres no need.
        Hide
        Uwe Schindler added a comment -

        I think Robert has stated here that he's comfortable continuing to use TokenStream as the API for IW to get the terms it indexes, is that what others feel too? I agree the inverted API I proposed is a little convoluted and I'm sure we can come up with a simple Consumable like abstraction (which Robert did also suggest above). But if people are content with TokenStream then theres no need.

        I feel the same. The API of TokenStream is so stupid-simple, why replace it by another push-like API that is not simplier nor more complicated, just different? I see no reason in this. IW should simply request a TokenStream from the field and consume it.

        Likewise, for multi-valued fields, IW shouldn't "see" the separate
        values; it should just receive a single token stream, and under the
        hood (in Document/Field impl) it's concatenating separate token
        streams, adding posIncr/offset gaps, etc. This too is now hardwired
        in indexer but shouldn't be. Maybe an app wants to insert custom
        "separator" tokens between the values...

        I agree with that, too. There is one problem with this: Concenatting TokenStreams is not easy to do, as they have different attribute instances, so IW getting all attributes at the start would then somehow in the middle of the TS have to change the attributes.

        To implement this fast (without wrapping and copying), we need some notification that the consumer of a TokenStream needs to "request" the attribute instances again, but this is a "bad" idea. For me the only simple solutions to this problem is to make the Field return an iterator of TokenStreams and IW consumes them one after each other, and doing the addAttribute before each separate instance.

        About the PosIncr Gap: The field can change the final offsets/posIncr in end() before handling over to a new TokenStream. IW would only consume TokenStreams one by one.

        Show
        Uwe Schindler added a comment - I think Robert has stated here that he's comfortable continuing to use TokenStream as the API for IW to get the terms it indexes, is that what others feel too? I agree the inverted API I proposed is a little convoluted and I'm sure we can come up with a simple Consumable like abstraction (which Robert did also suggest above). But if people are content with TokenStream then theres no need. I feel the same. The API of TokenStream is so stupid-simple, why replace it by another push-like API that is not simplier nor more complicated, just different? I see no reason in this. IW should simply request a TokenStream from the field and consume it. Likewise, for multi-valued fields, IW shouldn't "see" the separate values; it should just receive a single token stream, and under the hood (in Document/Field impl) it's concatenating separate token streams, adding posIncr/offset gaps, etc. This too is now hardwired in indexer but shouldn't be. Maybe an app wants to insert custom "separator" tokens between the values... I agree with that, too. There is one problem with this: Concenatting TokenStreams is not easy to do, as they have different attribute instances, so IW getting all attributes at the start would then somehow in the middle of the TS have to change the attributes. To implement this fast (without wrapping and copying), we need some notification that the consumer of a TokenStream needs to "request" the attribute instances again, but this is a "bad" idea. For me the only simple solutions to this problem is to make the Field return an iterator of TokenStreams and IW consumes them one after each other, and doing the addAttribute before each separate instance. About the PosIncr Gap: The field can change the final offsets/posIncr in end() before handling over to a new TokenStream. IW would only consume TokenStreams one by one.
        Hide
        Chris Male added a comment -

        I feel the same. The API of TokenStream is so stupid-simple, why replace it by another push-like API that is not simplier nor more complicated, just different? I see no reason in this. IW should simply request a TokenStream from the field and consume it.

        But you do favour a pull-like API as an alternative?

        Show
        Chris Male added a comment - I feel the same. The API of TokenStream is so stupid-simple, why replace it by another push-like API that is not simplier nor more complicated, just different? I see no reason in this. IW should simply request a TokenStream from the field and consume it. But you do favour a pull-like API as an alternative?
        Hide
        Uwe Schindler added a comment -

        But you do favour a pull-like API as an alternative?

        TokenStream is pull and I do favour this one.

        Show
        Uwe Schindler added a comment - But you do favour a pull-like API as an alternative? TokenStream is pull and I do favour this one.
        Hide
        Chris Male added a comment -

        Err yes sorry you're right.

        Show
        Chris Male added a comment - Err yes sorry you're right.
        Hide
        Robert Muir added a comment -

        I agree with that, too. There is one problem with this: Concenatting TokenStreams is not easy to do, as they have different attribute instances, so IW getting all attributes at the start would then somehow in the middle of the TS have to change the attributes.

        I don't think the attributes should be allowed to change here. This is why above i already said, we should enforce reusability. Then there is no problem.

        Show
        Robert Muir added a comment - I agree with that, too. There is one problem with this: Concenatting TokenStreams is not easy to do, as they have different attribute instances, so IW getting all attributes at the start would then somehow in the middle of the TS have to change the attributes. I don't think the attributes should be allowed to change here. This is why above i already said, we should enforce reusability. Then there is no problem.
        Hide
        Chris Male added a comment -

        Getting back on this after the Analyzer work.

        New patch is far more traditional and adds tokenStream(Analyzer) to IndexableField. This replaces tokenStreamValue(). Consumers wishing to index a field now call tokenStream(Analyzer) which is responsible to create the appropriate TokenStream for the field.

        Show
        Chris Male added a comment - Getting back on this after the Analyzer work. New patch is far more traditional and adds tokenStream(Analyzer) to IndexableField. This replaces tokenStreamValue(). Consumers wishing to index a field now call tokenStream(Analyzer) which is responsible to create the appropriate TokenStream for the field.
        Hide
        Uwe Schindler added a comment -

        Looks much more straigtforward now. I like this implementation.

        Show
        Uwe Schindler added a comment - Looks much more straigtforward now. I like this implementation.
        Hide
        Simon Willnauer added a comment -

        Looks much more straigtforward now. I like this implementation.

        +1 looks good though. much simpler too!

        Show
        Simon Willnauer added a comment - Looks much more straigtforward now. I like this implementation. +1 looks good though. much simpler too!
        Hide
        Michael McCandless added a comment -

        This looks great! I love all the -'d code from DocInverterPerField

        In Field.java do we already check that if the field is not tokenized then it has a non-null stringValue()?

        I would like to for IW to not have to pass through the Analyzer here (ie FieldType should know the Analyzer for that field), but let's save that for another issue/time.

        Likewise, multi-valued field should ideally be "under the hood" from IW's standpoint, ie we should have a MultiValuedField and you append to a List inside it, and then IW gets a single TokenStream from that, which does its own concatenating of the separate TokenStreams, but we should tackle that under a separate issue.

        Show
        Michael McCandless added a comment - This looks great! I love all the -'d code from DocInverterPerField In Field.java do we already check that if the field is not tokenized then it has a non-null stringValue()? I would like to for IW to not have to pass through the Analyzer here (ie FieldType should know the Analyzer for that field), but let's save that for another issue/time. Likewise, multi-valued field should ideally be "under the hood" from IW's standpoint, ie we should have a MultiValuedField and you append to a List inside it, and then IW gets a single TokenStream from that, which does its own concatenating of the separate TokenStreams, but we should tackle that under a separate issue.
        Hide
        Chris Male added a comment -

        In Field.java do we already check that if the field is not tokenized then it has a non-null stringValue()?

        I don't think we do. Its always been implied (which could cause a bug). I'll add the appropriate checks but we really need to revisit the constructors of Field at some stage.

        I would like to for IW to not have to pass through the Analyzer here (ie FieldType should know the Analyzer for that field), but let's save that for another issue/time.

        I totally agree. Theoretically FieldType could have Analyzer added to it now and it could make use of it. But removing the Analyzer from IW seems controversial, alas

        Likewise, multi-valued field should ideally be "under the hood" from IW's standpoint, ie we should have a MultiValuedField and you append to a List inside it, and then IW gets a single TokenStream from that, which does its own concatenating of the separate TokenStreams, but we should tackle that under a separate issue.

        Its nearly possible. We've almost there on the reusable Analyzers. This can already begin actually for non-tokenized fields and for NumericFields.

        I'll make the non-null StringValue checks and then commit.

        Show
        Chris Male added a comment - In Field.java do we already check that if the field is not tokenized then it has a non-null stringValue()? I don't think we do. Its always been implied (which could cause a bug). I'll add the appropriate checks but we really need to revisit the constructors of Field at some stage. I would like to for IW to not have to pass through the Analyzer here (ie FieldType should know the Analyzer for that field), but let's save that for another issue/time. I totally agree. Theoretically FieldType could have Analyzer added to it now and it could make use of it. But removing the Analyzer from IW seems controversial, alas Likewise, multi-valued field should ideally be "under the hood" from IW's standpoint, ie we should have a MultiValuedField and you append to a List inside it, and then IW gets a single TokenStream from that, which does its own concatenating of the separate TokenStreams, but we should tackle that under a separate issue. Its nearly possible. We've almost there on the reusable Analyzers. This can already begin actually for non-tokenized fields and for NumericFields. I'll make the non-null StringValue checks and then commit.
        Hide
        Chris Male added a comment -

        Committed revision 1174506.

        I think this issue is wrapped and we can spin the other improvements off?

        Show
        Chris Male added a comment - Committed revision 1174506. I think this issue is wrapped and we can spin the other improvements off?
        Hide
        Michael McCandless added a comment -

        Yeah I think we are done here! Nice work.

        Show
        Michael McCandless added a comment - Yeah I think we are done here! Nice work.
        Hide
        Chris Male added a comment -

        Change has been committed. We'll spin the multiValued fields work off as a separate issue.

        Show
        Chris Male added a comment - Change has been committed. We'll spin the multiValued fields work off as a separate issue.

          People

          • Assignee:
            Chris Male
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development