Solr
  1. Solr
  2. SOLR-4619

Improve PreAnalyzedField query analysis

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0, 4.1, 4.2, 4.2.1, 6.0
    • Fix Version/s: 5.5, 6.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      PreAnalyzed field extends plain FieldType and mistakenly uses the DefaultAnalyzer as query analyzer, and doesn't allow for customization via <analyzer> schema elements.

      Instead it should extend TextField and support all query analysis supported by that type.

      1. SOLR-4619.patch
        21 kB
        Steve Rowe
      2. SOLR-4619.patch
        20 kB
        Steve Rowe
      3. SOLR-4619.patch
        10 kB
        Steve Rowe
      4. SOLR-4619.patch
        3 kB
        Andrzej Bialecki

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          Comments by John Berryman, copied from SOLR-1535:

          Ah, I see. This is a bit lower level than I was thinking. Still useful, but different. I was thinking about having PreAnalyzedField extend directly from TextField rather than from FieldType, and then be able to build up whatever analysis chain that you want in the usual TextField sense. Query analysis would proceed as with a normal TextField, but index analysis would smart detect whether this input was already parsed or not. If the input was not parsed, then it would go through the normal analysis. On the other hand, if the input was already parsed, then the token stream would go straight into the index (the assumption being that someone upstream understands what they're doing).

          This way, in the SolrJ client you could build up some extra functionality so that the PreAnalyzedTextFields would be parsed client side and sent to Solr. In my current application, we have one Solr and N-indexers on different machines. The setup described here would take a big load off of Solr. The other benefit of this setup is that query analysis proceeds as it always does. I don't understand how someone would search over a PreAnalyzed field as it currently stands, without a bit of extra work/custom code on the client.

          One pitfall to my idea is that you'd have to create a similar PreAnalyzedIntField, PreAnalyzedLocationField, PreAnalyzedDateField etc. I wish Java had mixins or multiple inheritance.

          Show
          Andrzej Bialecki added a comment - Comments by John Berryman, copied from SOLR-1535 : Ah, I see. This is a bit lower level than I was thinking. Still useful, but different. I was thinking about having PreAnalyzedField extend directly from TextField rather than from FieldType, and then be able to build up whatever analysis chain that you want in the usual TextField sense. Query analysis would proceed as with a normal TextField, but index analysis would smart detect whether this input was already parsed or not. If the input was not parsed, then it would go through the normal analysis. On the other hand, if the input was already parsed, then the token stream would go straight into the index (the assumption being that someone upstream understands what they're doing). This way, in the SolrJ client you could build up some extra functionality so that the PreAnalyzedTextFields would be parsed client side and sent to Solr. In my current application, we have one Solr and N-indexers on different machines. The setup described here would take a big load off of Solr. The other benefit of this setup is that query analysis proceeds as it always does. I don't understand how someone would search over a PreAnalyzed field as it currently stands, without a bit of extra work/custom code on the client. One pitfall to my idea is that you'd have to create a similar PreAnalyzedIntField, PreAnalyzedLocationField, PreAnalyzedDateField etc. I wish Java had mixins or multiple inheritance.
          Hide
          Andrzej Bialecki added a comment -

          having PreAnalyzedField extend directly from TextField rather than from FieldType

          Done in the patch above. This means you can specify query analyzers.

          index analysis would smart detect whether this input was already parsed or not

          Well, this should be possible with the pluggable parsers that this field supports. Since the formats can be vastly different, depending on the parserImpl, I'm not sure if we could come up with a generic detection mechanism...

          Show
          Andrzej Bialecki added a comment - having PreAnalyzedField extend directly from TextField rather than from FieldType Done in the patch above. This means you can specify query analyzers. index analysis would smart detect whether this input was already parsed or not Well, this should be possible with the pluggable parsers that this field supports. Since the formats can be vastly different, depending on the parserImpl, I'm not sure if we could come up with a generic detection mechanism...
          Hide
          Steve Rowe added a comment -

          FYI, Andrzej, I see you have set 4.2.1 as the fix version - Mark Miller said on #lucene earlier today about cutting a 4.2.1 release: "i may go tonight or tomorrow depending".

          Show
          Steve Rowe added a comment - FYI, Andrzej, I see you have set 4.2.1 as the fix version - Mark Miller said on #lucene earlier today about cutting a 4.2.1 release: "i may go tonight or tomorrow depending".
          Hide
          John Berryman added a comment -

          Wow... thanks for being so proactive here. I'll need to look at this in more detail later.

          Show
          John Berryman added a comment - Wow... thanks for being so proactive here. I'll need to look at this in more detail later.
          Hide
          Andrzej Bialecki added a comment -

          Removing 4.2.1 from Fix version - this apparently needs more discussion.

          Show
          Andrzej Bialecki added a comment - Removing 4.2.1 from Fix version - this apparently needs more discussion.
          Hide
          David Smiley added a comment -

          Instead of having a special pre-analyzed field type, I think it makes more sense to introduce smarts into Solr's DocumentBuilder that sees a pre-analyzed field type value (e.g. a "Field" or some TokenStream of some sort) and instead of calling createField() on the field type, it simply populates the document. There are details to be worked out for sure, but I think this is superior to having a special field type that is pre-analyzed when other field types are not. Arguably all fields should be able to be pre-analyzed.

          Show
          David Smiley added a comment - Instead of having a special pre-analyzed field type, I think it makes more sense to introduce smarts into Solr's DocumentBuilder that sees a pre-analyzed field type value (e.g. a "Field" or some TokenStream of some sort) and instead of calling createField() on the field type, it simply populates the document. There are details to be worked out for sure, but I think this is superior to having a special field type that is pre-analyzed when other field types are not. Arguably all fields should be able to be pre-analyzed.
          Hide
          Andrzej Bialecki added a comment -

          David: hmm, I'm not sure about that ... this would complicate the processing of all field types in order to support a use case that is very specific. If you throw in the ability to support different serializations then the detection of whether a field content is in a pre-analyzed format or not is not that simple anymore. Do you have a use case in mind that would require switching on the fly between the regular and pre-analyzed formats?

          John: please test the patch that I attached. If this is what you had in mind then I'd like to commit it soon, as it's a clear improvement over the current functionality.

          Show
          Andrzej Bialecki added a comment - David: hmm, I'm not sure about that ... this would complicate the processing of all field types in order to support a use case that is very specific. If you throw in the ability to support different serializations then the detection of whether a field content is in a pre-analyzed format or not is not that simple anymore. Do you have a use case in mind that would require switching on the fly between the regular and pre-analyzed formats? John: please test the patch that I attached. If this is what you had in mind then I'd like to commit it soon, as it's a clear improvement over the current functionality.
          Hide
          David Smiley added a comment -

          I guess we see this very differently. I'm not arguing for any "switching on the fly" of anything, even if incidentally what I describe makes that possible. Your design couples the ability to pre-analyze with the choice of field type, and I think that coupling need not exist. It's already creating problems like this very issue (SOLR-4619). Query analysis wouldn't need any improving, it shouldn't have anything to do with wether the field data was analyzed for indexing before it got to Solr or not.

          this would complicate the processing of all field types

          Sure, but it's not much. I've already been poring over DocumentBuilder (as part of SOLR-4329) and I think I know what's involved. It needs to examine the field value: if it's "Field" then it gets added right to the document. A Solr UpdateRequestProcessor could convert the external serialization of the token stream from JSON/Avro/whatever to Field.

          Show
          David Smiley added a comment - I guess we see this very differently. I'm not arguing for any "switching on the fly" of anything, even if incidentally what I describe makes that possible. Your design couples the ability to pre-analyze with the choice of field type, and I think that coupling need not exist. It's already creating problems like this very issue ( SOLR-4619 ). Query analysis wouldn't need any improving, it shouldn't have anything to do with wether the field data was analyzed for indexing before it got to Solr or not. this would complicate the processing of all field types Sure, but it's not much. I've already been poring over DocumentBuilder (as part of SOLR-4329 ) and I think I know what's involved. It needs to examine the field value: if it's "Field" then it gets added right to the document. A Solr UpdateRequestProcessor could convert the external serialization of the token stream from JSON/Avro/whatever to Field.
          Hide
          Jan Høydahl added a comment -

          I like David Smileys generalized approach. Could you perhaps throw up a patch for discussion?

          Show
          Jan Høydahl added a comment - I like David Smiley s generalized approach. Could you perhaps throw up a patch for discussion?
          Hide
          David Smiley added a comment -

          Code matters more than words, for sure. I already have more work than I can handle at the moment and I'm sorry I can't contribute the implementation I describe right now.

          Show
          David Smiley added a comment - Code matters more than words, for sure. I already have more work than I can handle at the moment and I'm sorry I can't contribute the implementation I describe right now.
          Hide
          Andrzej Bialecki added a comment -

          David, thanks for explaining - I see your point, and I like it too.

          It needs to examine the field value: if it's "Field" then it gets added right to the document.

          Even if we push this functionality to an UpdateRequestProcessor it still needs to know when (not) to convert the input String to a Field. Do you have some thoughts about this?

          In UpdateRequestProcessor we deal with SolrInputDocument-s in the context of available schema and solrconfig - so there are a few options how to determine the need for conversion.

          1. It could be based on the document itself (tricky, in the light of multiple serialization formats).
          2. OR, we could extend SolrInputField (and the document serializations) to support additional per-field flags to indicate this, but that would complicate matters even more.
          3. OR, if this decision is going to be based on schema we would have to extend schema to pass additional flags to mark fields as preanalyzed - also tricky.
          4. And finally, we could put the list of fields to always convert in the init args of this UpdateRequestProcessor in solrconfig ... but that's a bit ugly, mixing schema and solrconfig.
          Show
          Andrzej Bialecki added a comment - David, thanks for explaining - I see your point, and I like it too. It needs to examine the field value: if it's "Field" then it gets added right to the document. Even if we push this functionality to an UpdateRequestProcessor it still needs to know when (not) to convert the input String to a Field. Do you have some thoughts about this? In UpdateRequestProcessor we deal with SolrInputDocument-s in the context of available schema and solrconfig - so there are a few options how to determine the need for conversion. It could be based on the document itself (tricky, in the light of multiple serialization formats). OR, we could extend SolrInputField (and the document serializations) to support additional per-field flags to indicate this, but that would complicate matters even more. OR, if this decision is going to be based on schema we would have to extend schema to pass additional flags to mark fields as preanalyzed - also tricky. And finally, we could put the list of fields to always convert in the init args of this UpdateRequestProcessor in solrconfig ... but that's a bit ugly, mixing schema and solrconfig.
          Hide
          David Smiley added a comment -

          1. It could be based on the document itself (tricky, in the light of multiple serialization formats).

          I don't understand.

          2. OR, we could extend SolrInputField (and the document serializations) to support additional per-field flags to indicate this, but that would complicate matters even more.

          If it were generalized, i.e. per-field map of metadata, then I rather like it. Though it'd take a fair amount of work I think to fully realize this (e.g. update SolrJ & XML & JSON input formats), and it might also make URP's into more of a full-fledged pipeline but I think such things are better done external to Solr.

          3. OR, if this decision is going to be based on schema we would have to extend schema to pass additional flags to mark fields as preanalyzed - also tricky.

          I don't like this, as it ties the choice of pre-analysis to the schema which I think is unnecessary coupling just as it is to the field type.

          4. And finally, we could put the list of fields to always convert in the init args of this UpdateRequestProcessor in solrconfig ... but that's a bit ugly, mixing schema and solrconfig.

          I don't see it as "mixing schema" unless you simply mean to say that fields are referred to outside of the schema. But heck, that's inevitable as fields are already referred to all over solrconfig.xml. It's unrealistic to expect the names of fields in one's schema to not exist outside of schema.xml – the app needs to know too

          One option would be to pass in a pseudo field _pre-analyzed_ (leading and trailing underscore) with a list of field names that are pre-analyzed. The sender of the data is certainly aware of which fields are pre-analyzed as it had to pre-analyze them, so it can simply communicate that.

          Show
          David Smiley added a comment - 1. It could be based on the document itself (tricky, in the light of multiple serialization formats). I don't understand. 2. OR, we could extend SolrInputField (and the document serializations) to support additional per-field flags to indicate this, but that would complicate matters even more. If it were generalized, i.e. per-field map of metadata, then I rather like it. Though it'd take a fair amount of work I think to fully realize this (e.g. update SolrJ & XML & JSON input formats), and it might also make URP's into more of a full-fledged pipeline but I think such things are better done external to Solr. 3. OR, if this decision is going to be based on schema we would have to extend schema to pass additional flags to mark fields as preanalyzed - also tricky. I don't like this, as it ties the choice of pre-analysis to the schema which I think is unnecessary coupling just as it is to the field type. 4. And finally, we could put the list of fields to always convert in the init args of this UpdateRequestProcessor in solrconfig ... but that's a bit ugly, mixing schema and solrconfig. I don't see it as "mixing schema" unless you simply mean to say that fields are referred to outside of the schema. But heck, that's inevitable as fields are already referred to all over solrconfig.xml. It's unrealistic to expect the names of fields in one's schema to not exist outside of schema.xml – the app needs to know too One option would be to pass in a pseudo field _pre-analyzed_ (leading and trailing underscore) with a list of field names that are pre-analyzed. The sender of the data is certainly aware of which fields are pre-analyzed as it had to pre-analyze them, so it can simply communicate that.
          Hide
          Andrzej Bialecki added a comment -

          So it looks to me like the least controversial option is to put the list of preanalyzed fields in solrconfig in the specification of the URP. The trick with a "magic" field name sounds useful too - it would allow overriding the list of fields on a per-document basis. This could be also achieved by passing the list of fields via SolrParams - although it would affect all documents in a given update request.

          Anyway, I think these are good ideas worth trying. I'll start working on a patch. Thanks for the comments!

          Show
          Andrzej Bialecki added a comment - So it looks to me like the least controversial option is to put the list of preanalyzed fields in solrconfig in the specification of the URP. The trick with a "magic" field name sounds useful too - it would allow overriding the list of fields on a per-document basis. This could be also achieved by passing the list of fields via SolrParams - although it would affect all documents in a given update request. Anyway, I think these are good ideas worth trying. I'll start working on a patch. Thanks for the comments!
          Hide
          Andrzej Bialecki added a comment -

          I came to conclusion that this is really a different issue, worth pursuing on its own, so I created SOLR-4648. The improvements in analysis that I mentioned in the original description, and the patch, should be applied regardless of SOLR-4648.

          Show
          Andrzej Bialecki added a comment - I came to conclusion that this is really a different issue, worth pursuing on its own, so I created SOLR-4648 . The improvements in analysis that I mentioned in the original description, and the patch, should be applied regardless of SOLR-4648 .
          Hide
          Steve Rowe added a comment - - edited

          Patch that brings Andrzej's patch up to date with trunk, and adds tests for query-time functionality.

          I had assumed that PreAnalyzedField-s would use the PreAnalyzedTokenizer at query time, but that is not (currently) the case: instead FieldType.DefaultAnalyzer is used. This patch changes the behavior when no analyzer is specified to instead use PreAnalyzedTokenizer.

          However, there is a chicken-and-egg interaction between PreAnalyzedTokenizer and QueryBuilder.createFieldQuery(), which aborts before performing any tokenization if the supplied analyzer's attribute factory doesn't contain a TermToBytesRefAttribute. But PreAnalyzedTokenizer doesn't have any attributes defined until the input stream is consumed, in reset(). Robert Muir added a comment as part of LUCENE-5388 to PreAnalyzedTokenizer's ctor, where AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY is set as the attribute factory rather than the default packed implementation: "we don't pack attributes: since we are used for (de)serialization and dont want bloat."

          This patch moves the stream.reset() call in QueryBuilder.createFieldQuery() in front of the TermToBytesRefAttribute check, so that PreAnalyzedTokenizer (and other tokenizers that don't have a pre-added set of attributes) has a chance to populate its attributes, and also moves the addAttribute(PositionIncrementAttribute.class) call to after the TermToBytesRefAttribute check, since that won't be needed if no tokenization will be performed.

          An alternate approach to fix the chicken-and-egg problem might be to have PreAnalyzedTokenizer always include a dummy TermToBytesRefAttribute implementation, and then remove it when reset() is called, but that seems hackish.

          I haven't run the full tests yet with this patch, but the included query-time PreAnalyzedField tests succeed.

          I welcome feedback.

          Show
          Steve Rowe added a comment - - edited Patch that brings Andrzej's patch up to date with trunk, and adds tests for query-time functionality. I had assumed that PreAnalyzedField -s would use the PreAnalyzedTokenizer at query time, but that is not (currently) the case: instead FieldType.DefaultAnalyzer is used. This patch changes the behavior when no analyzer is specified to instead use PreAnalyzedTokenizer . However, there is a chicken-and-egg interaction between PreAnalyzedTokenizer and QueryBuilder.createFieldQuery() , which aborts before performing any tokenization if the supplied analyzer's attribute factory doesn't contain a TermToBytesRefAttribute . But PreAnalyzedTokenizer doesn't have any attributes defined until the input stream is consumed, in reset() . Robert Muir added a comment as part of LUCENE-5388 to PreAnalyzedTokenizer 's ctor, where AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY is set as the attribute factory rather than the default packed implementation: "we don't pack attributes: since we are used for (de)serialization and dont want bloat." This patch moves the stream.reset() call in QueryBuilder.createFieldQuery() in front of the TermToBytesRefAttribute check, so that PreAnalyzedTokenizer (and other tokenizers that don't have a pre-added set of attributes) has a chance to populate its attributes, and also moves the addAttribute(PositionIncrementAttribute.class) call to after the TermToBytesRefAttribute check, since that won't be needed if no tokenization will be performed. An alternate approach to fix the chicken-and-egg problem might be to have PreAnalyzedTokenizer always include a dummy TermToBytesRefAttribute implementation, and then remove it when reset() is called, but that seems hackish. I haven't run the full tests yet with this patch, but the included query-time PreAnalyzedField tests succeed. I welcome feedback.
          Hide
          Robert Muir added a comment -

          But PreAnalyzedTokenizer doesn't have any attributes defined until the input stream is consumed, in reset()

          Right, thats a bug really. According to TokenStream's class javadocs:

          The workflow of the new TokenStream API is as follows:

          1. Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource.
          2. The consumer calls reset().
          3. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.

          So we have consumers (such as QueryBuilder) doing stuff out of order: if they do step 3 before they do step 2.

          My question is, can we detect this in tests? If MockAnalyzer can enforce it, it is easier to fix it consistently everywhere. One idea is if MockTokenizer deferred initializing its attributes until reset()? Its not going to be the best (we need to tie it into its state machine logic somehow for that), but it might be an easy step.

          Show
          Robert Muir added a comment - But PreAnalyzedTokenizer doesn't have any attributes defined until the input stream is consumed, in reset() Right, thats a bug really. According to TokenStream's class javadocs: The workflow of the new TokenStream API is as follows: 1. Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource. 2. The consumer calls reset(). 3. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access. So we have consumers (such as QueryBuilder) doing stuff out of order: if they do step 3 before they do step 2. My question is, can we detect this in tests? If MockAnalyzer can enforce it, it is easier to fix it consistently everywhere. One idea is if MockTokenizer deferred initializing its attributes until reset()? Its not going to be the best (we need to tie it into its state machine logic somehow for that), but it might be an easy step.
          Hide
          Robert Muir added a comment -

          Also, majority of TokenFilters (which basically also serve as consumers too), are doing step 3 before step 2 today. Most of them are just assigning to final variables in their constructor.

          So something is off: we gotta go one of two ways. Either fix the documentation to swap step 3 before step 2, and fix this TokenStream to somehow provide attributes before reset(), or we make a massive change to tons of tokenizers (making them more complex and less efficient).

          But I think we have to do something, at least we should fix the docs to be clear, they need to reflect reality.

          Show
          Robert Muir added a comment - Also, majority of TokenFilters (which basically also serve as consumers too), are doing step 3 before step 2 today. Most of them are just assigning to final variables in their constructor. So something is off: we gotta go one of two ways. Either fix the documentation to swap step 3 before step 2, and fix this TokenStream to somehow provide attributes before reset(), or we make a massive change to tons of tokenizers (making them more complex and less efficient). But I think we have to do something, at least we should fix the docs to be clear, they need to reflect reality.
          Hide
          Steve Rowe added a comment -

          Massive change doesn't seem warranted.

          But PreAnalyzedTokenizer doesn't have any attributes defined until the input stream is consumed, in reset()

          Right, thats a bug really.

          fix this TokenStream to somehow provide attributes before reset()

          Since the input reader must be consumed before the attributes can be provided, the tokenizer must somehow have access to the input reader prior to reset(). The most likely place is setReader(), but Tokenizer.setReader() is final.

          A new analyzer class employing PreAnalyzedTokenizer could override initReader() or setReader(). I'll try with setReader(), since the docs for initReader() are focused on reader conditioning via char filters.

          Show
          Steve Rowe added a comment - Massive change doesn't seem warranted. But PreAnalyzedTokenizer doesn't have any attributes defined until the input stream is consumed, in reset() Right, thats a bug really. fix this TokenStream to somehow provide attributes before reset() Since the input reader must be consumed before the attributes can be provided, the tokenizer must somehow have access to the input reader prior to reset(). The most likely place is setReader(), but Tokenizer.setReader() is final. A new analyzer class employing PreAnalyzedTokenizer could override initReader() or setReader(). I'll try with setReader(), since the docs for initReader() are focused on reader conditioning via char filters.
          Hide
          Steve Rowe added a comment -

          A new analyzer class employing PreAnalyzedTokenizer could override initReader() or setReader(). I'll try with setReader(), since the docs for initReader() are focused on reader conditioning via char filters.

          I was referring to TokenStreamComponents.setReader() here, which is called as part of Analyzer.tokenStream(): A subclass created in the new analyzer's overridden createComponents() could call a new method on PreAnalyzedTokenizer to consume the input reader and in so doing provide the attributes.

          Show
          Steve Rowe added a comment - A new analyzer class employing PreAnalyzedTokenizer could override initReader() or setReader(). I'll try with setReader(), since the docs for initReader() are focused on reader conditioning via char filters. I was referring to TokenStreamComponents.setReader() here, which is called as part of Analyzer.tokenStream(): A subclass created in the new analyzer's overridden createComponents() could call a new method on PreAnalyzedTokenizer to consume the input reader and in so doing provide the attributes.
          Hide
          Steve Rowe added a comment -

          A new analyzer class employing PreAnalyzedTokenizer could override initReader() or setReader(). I'll try with setReader(), since the docs for initReader() are focused on reader conditioning via char filters.

          I was referring to TokenStreamComponents.setReader() here, which is called as part of Analyzer.tokenStream(): A subclass created in the new analyzer's overridden createComponents() could call a new method on PreAnalyzedTokenizer to consume the input reader and in so doing provide the attributes.

          Patch implementing the idea, splitting reader consumption out from reset() into its own method: decodeInput(). This method first removes all attributes from PreAnalyzedTokenizer's AttributeSource, then adds needed ones as a side effect of parsing the input.

          There is a kludge here: because TokenStreamComponents.setReader() doesn't throw an exception, PreAnalyzedAnalyzer overrides createComponents() to create a TokenStreamComponents instance that catches and stores exceptions encountered during reader consumption with the stream's PreAnalyzedTokenizer instance, whose reset() method will then throw the stored exception, if any.

          With this patch, PreAnalyzedAnalyzer can be reused; previously PreAnalyzedTokenizer reuse would ignore new input and re-emit tokens deserialized from the initial input.

          With this patch, PreAnalyzedField analysis works like this:

          1. If a query analyzer is specified in the schema then it will be used at query time.
          2. If an analyzer is specified in the schema with no type (i.e., it is neither of "index" nor "query" type), then this analyzer will be used for query parsing, but will be ignored at index time.
          3. If only an analyzer of "index" type is specified in the schema, then this analyzer will be used for query parsing, but will be ignored at index time.

          This patch adds a new method removeAllAttributes() to AttributeSource, to support reuse of token streams with variable attributes, like PreAnalyzedTokenizer.

          I think it's ready to go.

          Show
          Steve Rowe added a comment - A new analyzer class employing PreAnalyzedTokenizer could override initReader() or setReader(). I'll try with setReader(), since the docs for initReader() are focused on reader conditioning via char filters. I was referring to TokenStreamComponents.setReader() here, which is called as part of Analyzer.tokenStream(): A subclass created in the new analyzer's overridden createComponents() could call a new method on PreAnalyzedTokenizer to consume the input reader and in so doing provide the attributes. Patch implementing the idea, splitting reader consumption out from reset() into its own method: decodeInput(). This method first removes all attributes from PreAnalyzedTokenizer's AttributeSource, then adds needed ones as a side effect of parsing the input. There is a kludge here: because TokenStreamComponents.setReader() doesn't throw an exception, PreAnalyzedAnalyzer overrides createComponents() to create a TokenStreamComponents instance that catches and stores exceptions encountered during reader consumption with the stream's PreAnalyzedTokenizer instance, whose reset() method will then throw the stored exception, if any. With this patch, PreAnalyzedAnalyzer can be reused; previously PreAnalyzedTokenizer reuse would ignore new input and re-emit tokens deserialized from the initial input. With this patch, PreAnalyzedField analysis works like this: If a query analyzer is specified in the schema then it will be used at query time. If an analyzer is specified in the schema with no type (i.e., it is neither of "index" nor "query" type), then this analyzer will be used for query parsing, but will be ignored at index time. If only an analyzer of "index" type is specified in the schema, then this analyzer will be used for query parsing, but will be ignored at index time. This patch adds a new method removeAllAttributes() to AttributeSource, to support reuse of token streams with variable attributes, like PreAnalyzedTokenizer. I think it's ready to go.
          Hide
          Steve Rowe added a comment -

          Updated patch, fixes PreAnalyzedFieldTest.testInvalidJson() to properly initialize its PreAnalyzedField, and to call reset() on the token streams created with the invalid JSON snippets, so that the exceptions triggered by the invalid JSON and stored during token stream creation are appropriately thrown. Also tests that PreAnalyzedAnalyzer can be reused with valid input after having been fed invalid input.

          Show
          Steve Rowe added a comment - Updated patch, fixes PreAnalyzedFieldTest.testInvalidJson() to properly initialize its PreAnalyzedField, and to call reset() on the token streams created with the invalid JSON snippets, so that the exceptions triggered by the invalid JSON and stored during token stream creation are appropriately thrown. Also tests that PreAnalyzedAnalyzer can be reused with valid input after having been fed invalid input.
          Hide
          Steve Rowe added a comment -

          I plan on committing this later today if there are no objections.

          Show
          Steve Rowe added a comment - I plan on committing this later today if there are no objections.
          Hide
          ASF subversion and git services added a comment -

          Commit 1725869 from Steve Rowe in branch 'dev/trunk'
          [ https://svn.apache.org/r1725869 ]

          SOLR-4619: Improve PreAnalyzedField query analysis

          Show
          ASF subversion and git services added a comment - Commit 1725869 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1725869 ] SOLR-4619 : Improve PreAnalyzedField query analysis
          Hide
          ASF subversion and git services added a comment -

          Commit 1725871 from Steve Rowe in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1725871 ]

          SOLR-4619: Improve PreAnalyzedField query analysis (merged trunk r1725869)

          Show
          ASF subversion and git services added a comment - Commit 1725871 from Steve Rowe in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1725871 ] SOLR-4619 : Improve PreAnalyzedField query analysis (merged trunk r1725869)
          Hide
          Steve Rowe added a comment -

          Committed to trunk and branch_5x.

          I opened LUCENE-6987 to address the TokenStream workflow documentation.

          Show
          Steve Rowe added a comment - Committed to trunk and branch_5x. I opened LUCENE-6987 to address the TokenStream workflow documentation.
          Hide
          ASF subversion and git services added a comment -

          Commit 1726069 from Steve Rowe in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1726069 ]

          SOLR-4619: Fix Java7-only javadoc warnings by removing

          {@link}

          to private static nested class PreAnalyzedTokenizer

          Show
          ASF subversion and git services added a comment - Commit 1726069 from Steve Rowe in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1726069 ] SOLR-4619 : Fix Java7-only javadoc warnings by removing {@link} to private static nested class PreAnalyzedTokenizer

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Andrzej Bialecki
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development