Solr
  1. Solr
  2. SOLR-2802

Toolkit of UpdateProcessors for modifying document values

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Frequently users ask about questions about things where the answer is "you could do it with an UpdateProcessor" but the number of our of hte box UpdateProcessors is generally lacking and there aren't even very good base classes for the common case of manipulating field values when adding documents

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Patch containing a rough start for the type of thing i had in mind.

          this patch implements 2 usable UpdateProcessors...

          • TrimFieldUpdateProcessorFactory
          • ConcatFieldUpdateProcessorFactory

          ...using these abstract subclasses...

          • FieldMutatingUpdateProcessorFactory - handles configuration for what fields the processor should act on (by name, type, name regex, or type class)
          • FieldMutatingUpdateProcessor handles the rote work of dealing with AddUpdateCommands and checking which fields the configuration indicates should be modified, so subclasses can focus solely on the relevant SolrInputFields
          • FieldValueMutatingUpdateProcessor - handles the rote work of dealing with SolrInputFields when subclasses just want to modify all individual values of a field in place

          Additional subclasses that seem like they would be useful, easy to implement, and fit easily into this framework would be...

          • RemoveBlankFieldUpdateProcessorFactory - ie: toss ""
          • HTMLStripFieldUpdateProcessorFactory
          • FirstFieldValueUpdateProcessorFactory
          • LastFieldValueUpdateProcessorFactory
          • ParseNumericFieldUpdateProcessorFactory - preconfigured formats
          • ParseDateFieldUpdateProcessorFactory - reconfigured formats, tz from field
          • ParseBooleanFieldUpdateProcessorFactory - configured lists of values to map to true/false

          Would be helpful to get feedback in particular on the field config strategy. My thinking is that in general they should default to mutating all fields, but ignore things that don't match expectations (ie: Trim doesn't mess with things that aren't Strings); but some subclasses could default to things based on the implementing class. Also seems like it would be helpful to support "excluding" fields (by name, regex, type, etc...)

          Show
          Hoss Man added a comment - Patch containing a rough start for the type of thing i had in mind. this patch implements 2 usable UpdateProcessors... TrimFieldUpdateProcessorFactory ConcatFieldUpdateProcessorFactory ...using these abstract subclasses... FieldMutatingUpdateProcessorFactory - handles configuration for what fields the processor should act on (by name, type, name regex, or type class) FieldMutatingUpdateProcessor handles the rote work of dealing with AddUpdateCommands and checking which fields the configuration indicates should be modified, so subclasses can focus solely on the relevant SolrInputFields FieldValueMutatingUpdateProcessor - handles the rote work of dealing with SolrInputFields when subclasses just want to modify all individual values of a field in place Additional subclasses that seem like they would be useful, easy to implement, and fit easily into this framework would be... RemoveBlankFieldUpdateProcessorFactory - ie: toss "" HTMLStripFieldUpdateProcessorFactory FirstFieldValueUpdateProcessorFactory LastFieldValueUpdateProcessorFactory ParseNumericFieldUpdateProcessorFactory - preconfigured formats ParseDateFieldUpdateProcessorFactory - reconfigured formats, tz from field ParseBooleanFieldUpdateProcessorFactory - configured lists of values to map to true/false Would be helpful to get feedback in particular on the field config strategy. My thinking is that in general they should default to mutating all fields, but ignore things that don't match expectations (ie: Trim doesn't mess with things that aren't Strings); but some subclasses could default to things based on the implementing class. Also seems like it would be helpful to support "excluding" fields (by name, regex, type, etc...)
          Hide
          Martijn van Groningen added a comment -

          +1 This is really needed!
          I think in general that processors should match nothing by default. Could lead to unexpected behaviour for users in the long run. If a user wants that a processor is applied to all compatible fields he / she can configure:

          <str name="fieldRegex">.*</str>

          Or alternatively we can introduce a match all field like this:

          <str name="fieldName">*</str>

          To make it somewhat easier for end users.

          Show
          Martijn van Groningen added a comment - +1 This is really needed! I think in general that processors should match nothing by default. Could lead to unexpected behaviour for users in the long run. If a user wants that a processor is applied to all compatible fields he / she can configure: <str name= "fieldRegex" > .* </str> Or alternatively we can introduce a match all field like this: <str name= "fieldName" > * </str> To make it somewhat easier for end users.
          Hide
          Jan Høydahl added a comment -

          Yep, we need more processors, and also a better framework. I already have a FieldCopy processor which can copy/move fields, will attach it to SOLR-2599 shortly.

          Show
          Jan Høydahl added a comment - Yep, we need more processors, and also a better framework. I already have a FieldCopy processor which can copy/move fields, will attach it to SOLR-2599 shortly.
          Hide
          Erik Hatcher added a comment -

          With SOLR-2599, I imagine we could take copyField's out of schema.xml, eh? Interesting consequences of that.

          Before we get too carried away, what about making this even more general purpose with scripting, ala SOLR-1725 ?

          There's one other update processor that perhaps could fit within this framework and become something generally useful in Solr - SOLR-1280

          Show
          Erik Hatcher added a comment - With SOLR-2599 , I imagine we could take copyField's out of schema.xml, eh? Interesting consequences of that. Before we get too carried away, what about making this even more general purpose with scripting, ala SOLR-1725 ? There's one other update processor that perhaps could fit within this framework and become something generally useful in Solr - SOLR-1280
          Hide
          Jan Høydahl added a comment -

          I think <copyField> should still remain in Schema, as it is often very tight linked with things like copying fields for spellcheck or creating a facet field for an analyzed field. Thus, it is much more clear when that config is distributed together in the schema, instead of as a processor config, causing the schema design to fail if user selects another update.chain than the one you expected etc. FieldCopy processor is made to copy, concatenate or rename fields because you may need a fieldCopy in the context of one update.chain, but maybe not in the context of another one.

          My vision (which I'll also address in Barcelona) is to improve the updateChain architecture in several ways, one is that script processors (SOLR-1725) become first-class citizen processors.

          Show
          Jan Høydahl added a comment - I think <copyField> should still remain in Schema, as it is often very tight linked with things like copying fields for spellcheck or creating a facet field for an analyzed field. Thus, it is much more clear when that config is distributed together in the schema, instead of as a processor config, causing the schema design to fail if user selects another update.chain than the one you expected etc. FieldCopy processor is made to copy, concatenate or rename fields because you may need a fieldCopy in the context of one update.chain, but maybe not in the context of another one. My vision (which I'll also address in Barcelona) is to improve the updateChain architecture in several ways, one is that script processors ( SOLR-1725 ) become first-class citizen processors.
          Hide
          Hoss Man added a comment -

          I already have a FieldCopy processor which can copy/move fields,

          Jan: Yeah ... I designed the base class arround the assumption that we would come up with a good "clone fields" processor in SOLR-2599, so that they can simply modify the values "in place" and people can clone/rename fields as needed before using them

          With SOLR-2599, I imagine we could take copyField's out of schema.xml,

          Erik: I actually consider them very orthogonal. Supporting cloning/copying in an update processor is a way of saying "when docs are added to the index using this Update Chain, take these actions on the fields" but copyField in schema.xml is a way of saying "no matter where this doc comes from, the value of field X should also be put in field Y"

          Before we get too carried away, what about making this even more general purpose with scripting, ala SOLR-1725 ?

          We definitely should get the Script Processor in for people who don't know java but have specific goals, but we shouldn't let support for scripting prevent us from implementing some of the more commonly requested actions in java - there's a fine line between "you can write scripts to do anything you want" and "you have to write scripts to do everything you want"

          There's one other update processor that perhaps could fit within this framework and become something generally useful in Solr - SOLR-1280

          I looked at that one before i started actually because of the "modify in place" nature of this base class, it didn't really seem like a good fit to try and refactor that one to be a subclass.

          I think in general that processors should match nothing by default. Could lead to unexpected behaviour for users in the long run.

          Martijn: I kept going back and forth on this while i was working on it. Ultimately my thought process was that it didn't really make sense for the "default" to be a No-Op because if that's the case then what's the point of having a default at all?

          And if we're going to require that they provide at least one of the field selectors, and we want to offer them syntactic sugar for "match all field" why not make it the shortest sugar possible?.

          I figured it would make sense for the base class to assume that "no args" ment let the subclass see all of the fields/values – and the subclasses could enforce their own rules default rules as needed, ala...

          • implicitly...
            • in the TrimFieldUpdateProcessorFactory attached, it ignores anything that isn't an instance of String – regardless of how it's configured (so it doesn't call toString() on an Integer and then try to trim that)
          • explicitly
            • i imagine that Date/Number parsing update processors should default to only trying to parse fields where the FieldType extends DateField/TrieField (the Concat processor should probably do the same for StrFields fields configured to be multiValued=false now that i think about it). But unlike how the Trim processor works, if they are explicitly configuring it to parse fields named "foo.*" they should try to do so regardless of what the field type/settings might be, because maybe a subsequent processor will renamed/move those fields in the input docs to something that is expecting a Date/Number (or does support multivalued fields)

          what do you think?

          the scenario that still bothers me about all this is that if we put something like this in the example schema...

          <updateRequestProcessorChain name="simple" default="true">
           <processor class="solr.TrimFieldUpdateProcessorFactory" />
           <processor class="solr.LogUpdateProcessorFactory" />
           <processor class="solr.RunUpdateProcessorFactory" />
          </updateRequestProcessorChain>
          

          ...(so all strings get trimmed) someone might say "Hey, stop trimming my strings!" and it's easy for them to remove that from the example. But someone else might say: "This is exactly what i want most of the time, but I've got this one field where whitespace matters, stop trimming that one." – and now he's got to jump through a lot of hoops to keep the trim behavior on all but on field (unless we add some sort of exclusion option(s)). Even if we make some field selection args mandatory for the processor and use this instead...

          <updateRequestProcessorChain name="simple" default="true">
           <processor class="solr.TrimFieldUpdateProcessorFactory">
             <str name="fieldRegex">.*</str>
           </processor>
           <processor class="solr.LogUpdateProcessorFactory" />
           <processor class="solr.RunUpdateProcessorFactory" />
          </updateRequestProcessorChain>
          

          ..that user still has the same amount of pain to deal with.

          Show
          Hoss Man added a comment - I already have a FieldCopy processor which can copy/move fields, Jan: Yeah ... I designed the base class arround the assumption that we would come up with a good "clone fields" processor in SOLR-2599 , so that they can simply modify the values "in place" and people can clone/rename fields as needed before using them With SOLR-2599 , I imagine we could take copyField's out of schema.xml, Erik: I actually consider them very orthogonal. Supporting cloning/copying in an update processor is a way of saying "when docs are added to the index using this Update Chain, take these actions on the fields" but copyField in schema.xml is a way of saying "no matter where this doc comes from, the value of field X should also be put in field Y" Before we get too carried away, what about making this even more general purpose with scripting, ala SOLR-1725 ? We definitely should get the Script Processor in for people who don't know java but have specific goals, but we shouldn't let support for scripting prevent us from implementing some of the more commonly requested actions in java - there's a fine line between "you can write scripts to do anything you want" and "you have to write scripts to do everything you want" There's one other update processor that perhaps could fit within this framework and become something generally useful in Solr - SOLR-1280 I looked at that one before i started actually because of the "modify in place" nature of this base class, it didn't really seem like a good fit to try and refactor that one to be a subclass. I think in general that processors should match nothing by default. Could lead to unexpected behaviour for users in the long run. Martijn: I kept going back and forth on this while i was working on it. Ultimately my thought process was that it didn't really make sense for the "default" to be a No-Op because if that's the case then what's the point of having a default at all? And if we're going to require that they provide at least one of the field selectors, and we want to offer them syntactic sugar for "match all field" why not make it the shortest sugar possible?. I figured it would make sense for the base class to assume that "no args" ment let the subclass see all of the fields/values – and the subclasses could enforce their own rules default rules as needed, ala... implicitly... in the TrimFieldUpdateProcessorFactory attached, it ignores anything that isn't an instance of String – regardless of how it's configured (so it doesn't call toString() on an Integer and then try to trim that) explicitly i imagine that Date/Number parsing update processors should default to only trying to parse fields where the FieldType extends DateField/TrieField (the Concat processor should probably do the same for StrFields fields configured to be multiValued=false now that i think about it). But unlike how the Trim processor works, if they are explicitly configuring it to parse fields named "foo.*" they should try to do so regardless of what the field type/settings might be, because maybe a subsequent processor will renamed/move those fields in the input docs to something that is expecting a Date/Number (or does support multivalued fields) what do you think? the scenario that still bothers me about all this is that if we put something like this in the example schema... <updateRequestProcessorChain name= "simple" default = " true " > <processor class= "solr.TrimFieldUpdateProcessorFactory" /> <processor class= "solr.LogUpdateProcessorFactory" /> <processor class= "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> ...(so all strings get trimmed) someone might say "Hey, stop trimming my strings!" and it's easy for them to remove that from the example. But someone else might say: "This is exactly what i want most of the time, but I've got this one field where whitespace matters, stop trimming that one." – and now he's got to jump through a lot of hoops to keep the trim behavior on all but on field (unless we add some sort of exclusion option(s)). Even if we make some field selection args mandatory for the processor and use this instead... <updateRequestProcessorChain name= "simple" default = " true " > <processor class= "solr.TrimFieldUpdateProcessorFactory" > <str name= "fieldRegex" >.*</str> </processor> <processor class= "solr.LogUpdateProcessorFactory" /> <processor class= "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> ..that user still has the same amount of pain to deal with.
          Hide
          Martijn van Groningen added a comment -

          And if we're going to require that they provide at least one of the field selectors, and we want to offer them syntactic sugar for "match all field" why not make it the shortest sugar possible?.

          I'm doubting now about this as well... Maybe we should go with a default that processes all fields. I guess that this behaviour should be documented well. If a user doesn't want for example trimming then the TrimFieldUpdateProcessorFactory shouldn't have been configured in the first place.

          I think I don't completely follow the explicit ruling. I think Date/Number parsing should only be done on compatible fields only. I think if a subsequent parser moves / renames fields, then this processor should have been configured before the processor that does the Date/Number parsing. If processors also try to parse incompatible fields, then this might end up in an exception being thrown (depends on if exceptions get swallowed or not).

          behavior on all but on field (unless we add some sort of exclusion option(s)).

          +1 We definitely need this!

          Show
          Martijn van Groningen added a comment - And if we're going to require that they provide at least one of the field selectors, and we want to offer them syntactic sugar for "match all field" why not make it the shortest sugar possible?. I'm doubting now about this as well... Maybe we should go with a default that processes all fields. I guess that this behaviour should be documented well. If a user doesn't want for example trimming then the TrimFieldUpdateProcessorFactory shouldn't have been configured in the first place. I think I don't completely follow the explicit ruling. I think Date/Number parsing should only be done on compatible fields only. I think if a subsequent parser moves / renames fields, then this processor should have been configured before the processor that does the Date/Number parsing. If processors also try to parse incompatible fields, then this might end up in an exception being thrown (depends on if exceptions get swallowed or not). behavior on all but on field (unless we add some sort of exclusion option(s)). +1 We definitely need this!
          Hide
          Hoss Man added a comment -

          I had some time to revisit this issue more again today.

          Improvements in this patch:

          • exclude options - you can now specify one ore more sets of "exclude" lists which are parsed just like the main list of field specifies (examples below)
          • improved defaults for ConcatFieldUpdateProcessorFactory - default behavior is now to only concat values for fields that the schema says are multiValued=false and (StrField or TextField)
          • new RemoveBlankFieldUpdateProcessorFactory - removes any 0 length CharSequence values it finds, by default looks at all fields
          • new FieldLengthUpdateProcessorFactory - replaces any CharSequence values it finds with their length, by default it looks at no fields

          As part of this work, i tweaked the abstract classes so that the "default" assumption about what fields a subclass should match "by default" is still "all fields" but it's easy for the subclasses to override this – the user still has the final say, and the abstract class handles that, but if the user doesn't configure anything the sub-class can easily say "my default should be ___"

          I think I don't completely follow the explicit ruling

          I explained myself really terribly before - i was convoluting what should really be two orthogonal things:

          1) the field names that a processor looks at – the user should have lots of options for configuring the field selector explicitly, and if they don't, then a sensible default based on the specifics of the processor should be applied, and the user should still have the ability to configure exclusion rules on top of that default

          2) the values types that a process will deal with – regardless of what field names a processor is configured with, it should be logical about the types of values it finds in those fields. The FieldLengthUpdateProcessorFactory i just added for example only pays attention to values that are CharSequence, if for example the SolrInputField already contained an Integer wouldn't make sense to toString() that and then find the length of that String vlaue.

          I think Date/Number parsing should only be done on compatible fields only. I think if a subsequent parser moves / renames fields, then this processor should have been configured before the processor that does the Date/Number parsing.

          But that could easily lead to a chicken-vs-egg problem. I think ideally you should be able to have field names in your SolrInputDocuments (and in your processor configurations) that don't exist in your schema at all, so you can have "transitory" names that exist purely for passing info arround.

          Imagine a situation where you want to let clients submit documents containing a "publishDate" field, but you want to be able to cleanly accept real Date objects (from java clients) or Strings in a variety of formats, and then you want the final index to contain two versions of that date: one indexed TrieDateField called "pubDate", and one non indexed StrField called "prettyDate" – ie, there is no "publishDate" in your schema at all. You could then configure some "ParseDateFieldUpdateProcessor" on the "publishDate" even though that field name isn't in your schema, so that you have consistent Date objects, and then use a CloneFieldUpdateProcessor and/or RenameFieldUpdateProcessor to get that Date object into both your "pubDate" and "prettyDate" fields, and then use some sort of FormatDateFieldUpdateProcessor on the "prettyDate" field.

          There may be other solutions to that type of problem, but I guess the bottom line from my perspective is: why bother making a processor deliberately fails the user configures it to do something unexpected but still viable? If they want to Parse Strings -> Dates on a TrieIntField, why not just let them do it? maybe they've got another processor later that is going to convert that Date to "days since epoc" as an integer?

          Examples of the exclude configuration...

          <updateRequestProcessorChain name="trim-few">
            <processor class="solr.TrimFieldUpdateProcessorFactory">
              <str name="fieldRegex">foo.*</str>
              <str name="fieldRegex">bar.*</str>
              <!-- each set of exclusions is checked independently -->
              <lst name="exclude">
                <str name="typeClass">solr.DateField</str>
              </lst>
              <lst name="exclude">
                <str name="fieldRegex">.*HOSS.*</str>
              </lst>
            </processor>
          </updateRequestProcessorChain>
          <updateRequestProcessorChain name="trim-some">
            <processor class="solr.TrimFieldUpdateProcessorFactory">
              <str name="fieldRegex">foo.*</str>
              <str name="fieldRegex">bar.*</str>
              <!-- only excluded if it matches all in set -->
              <lst name="exclude">
                <str name="typeClass">solr.DateField</str>
                <str name="fieldRegex">.*HOSS.*</str>
              </lst>
            </processor>
          </updateRequestProcessorChain>
          

          In the "trim-few" case, field names will be excluded if they are DateFields or match the "HOSS" regex. In the "trim-some" case, field names will be excluded only if they are both a DateField and match the "HOSS" regex.

          Show
          Hoss Man added a comment - I had some time to revisit this issue more again today. Improvements in this patch: exclude options - you can now specify one ore more sets of "exclude" lists which are parsed just like the main list of field specifies (examples below) improved defaults for ConcatFieldUpdateProcessorFactory - default behavior is now to only concat values for fields that the schema says are multiValued=false and (StrField or TextField) new RemoveBlankFieldUpdateProcessorFactory - removes any 0 length CharSequence values it finds, by default looks at all fields new FieldLengthUpdateProcessorFactory - replaces any CharSequence values it finds with their length, by default it looks at no fields As part of this work, i tweaked the abstract classes so that the "default" assumption about what fields a subclass should match "by default" is still "all fields" but it's easy for the subclasses to override this – the user still has the final say, and the abstract class handles that, but if the user doesn't configure anything the sub-class can easily say "my default should be ___" I think I don't completely follow the explicit ruling I explained myself really terribly before - i was convoluting what should really be two orthogonal things: 1) the field names that a processor looks at – the user should have lots of options for configuring the field selector explicitly, and if they don't, then a sensible default based on the specifics of the processor should be applied, and the user should still have the ability to configure exclusion rules on top of that default 2) the values types that a process will deal with – regardless of what field names a processor is configured with, it should be logical about the types of values it finds in those fields. The FieldLengthUpdateProcessorFactory i just added for example only pays attention to values that are CharSequence, if for example the SolrInputField already contained an Integer wouldn't make sense to toString() that and then find the length of that String vlaue. I think Date/Number parsing should only be done on compatible fields only. I think if a subsequent parser moves / renames fields, then this processor should have been configured before the processor that does the Date/Number parsing. But that could easily lead to a chicken-vs-egg problem. I think ideally you should be able to have field names in your SolrInputDocuments (and in your processor configurations) that don't exist in your schema at all, so you can have "transitory" names that exist purely for passing info arround. Imagine a situation where you want to let clients submit documents containing a "publishDate" field, but you want to be able to cleanly accept real Date objects (from java clients) or Strings in a variety of formats, and then you want the final index to contain two versions of that date: one indexed TrieDateField called "pubDate", and one non indexed StrField called "prettyDate" – ie, there is no "publishDate" in your schema at all. You could then configure some "ParseDateFieldUpdateProcessor" on the "publishDate" even though that field name isn't in your schema, so that you have consistent Date objects, and then use a CloneFieldUpdateProcessor and/or RenameFieldUpdateProcessor to get that Date object into both your "pubDate" and "prettyDate" fields, and then use some sort of FormatDateFieldUpdateProcessor on the "prettyDate" field. There may be other solutions to that type of problem, but I guess the bottom line from my perspective is: why bother making a processor deliberately fails the user configures it to do something unexpected but still viable? If they want to Parse Strings -> Dates on a TrieIntField, why not just let them do it? maybe they've got another processor later that is going to convert that Date to "days since epoc" as an integer? Examples of the exclude configuration... <updateRequestProcessorChain name= "trim-few" > <processor class= "solr.TrimFieldUpdateProcessorFactory" > <str name= "fieldRegex" >foo.*</str> <str name= "fieldRegex" >bar.*</str> <!-- each set of exclusions is checked independently --> <lst name= "exclude" > <str name= "typeClass" >solr.DateField</str> </lst> <lst name= "exclude" > <str name= "fieldRegex" >.*HOSS.*</str> </lst> </processor> </updateRequestProcessorChain> <updateRequestProcessorChain name= "trim-some" > <processor class= "solr.TrimFieldUpdateProcessorFactory" > <str name= "fieldRegex" >foo.*</str> <str name= "fieldRegex" >bar.*</str> <!-- only excluded if it matches all in set --> <lst name= "exclude" > <str name= "typeClass" >solr.DateField</str> <str name= "fieldRegex" >.*HOSS.*</str> </lst> </processor> </updateRequestProcessorChain> In the "trim-few" case, field names will be excluded if they are DateFields or match the "HOSS" regex. In the "trim-some" case, field names will be excluded only if they are both a DateField and match the "HOSS" regex.
          Hide
          Lance Norskog added a comment -

          Could this patch support scripting? Like in SOLR-1725? That is, does it support the range of problems that a scripting processor needs? Would it make writing scripts much easier?

          Show
          Lance Norskog added a comment - Could this patch support scripting? Like in SOLR-1725 ? That is, does it support the range of problems that a scripting processor needs? Would it make writing scripts much easier?
          Hide
          Hoss Man added a comment -

          Lance: see my 01/Oct/11 comment responding to Erik's nearly identical question (re: scripting)

          Updated patch adds some more tools to the tool box...

          • HTMLStripFieldUpdateProcessorFactory
          • FirstFieldValueUpdateProcessorFactory
          • LastFieldValueUpdateProcessorFactory
          • MinFieldValueUpdateProcessorFactory
          • MaxFieldValueUpdateProcessorFactory

          The last 4 are subclasses of a new intermediate base class "FieldValueSubsetUpdateProcessorFactory" designed to make it easy to write subclasses that want to look at all of the values for a field as a Collection, and then return some subset of those values. (in these 4 cases, the subset is always a collection containing a single value)

          Working with this patch today was he first time i didn't feel like i needed to tweak the existing base classes, which has me starting to think that the API is probably gelled enough for mass consumption. I'd really like to get some more voices chiming in on what they think of the field selector configuration schema so we can move forward on getting what's here committed on trunk, and then iteratively work on adding more concrete subclasses until we're really sure that this API makes sense for people to use when writing subclasses, and then look at backporting to 3x.

          anyone have any opinions on the syntax?

          Show
          Hoss Man added a comment - Lance: see my 01/Oct/11 comment responding to Erik's nearly identical question (re: scripting) Updated patch adds some more tools to the tool box... HTMLStripFieldUpdateProcessorFactory FirstFieldValueUpdateProcessorFactory LastFieldValueUpdateProcessorFactory MinFieldValueUpdateProcessorFactory MaxFieldValueUpdateProcessorFactory The last 4 are subclasses of a new intermediate base class "FieldValueSubsetUpdateProcessorFactory" designed to make it easy to write subclasses that want to look at all of the values for a field as a Collection, and then return some subset of those values. (in these 4 cases, the subset is always a collection containing a single value) Working with this patch today was he first time i didn't feel like i needed to tweak the existing base classes, which has me starting to think that the API is probably gelled enough for mass consumption. I'd really like to get some more voices chiming in on what they think of the field selector configuration schema so we can move forward on getting what's here committed on trunk, and then iteratively work on adding more concrete subclasses until we're really sure that this API makes sense for people to use when writing subclasses, and then look at backporting to 3x. anyone have any opinions on the syntax?
          Hide
          Hoss Man added a comment -

          Updated patch adds a new RegexpReplaceProcessorFactory.

          this new factory is a refactoring of the one Jan attached to SOLR-2825, changed to utilize the base classes created in this issue.

          it has all of the same core functionality as the original, but because of the base class, can support all of the field selector options i've mentioned before, instead of a single comma seperated list of field names. I also added some additional error checking that doesn't exist in the SOLR-2825 version related to more explicit error messages when params are missing or patterns are invalid. in spite of these additions, this version is only half as many lines of code. (due to the base class reuse)

          Show
          Hoss Man added a comment - Updated patch adds a new RegexpReplaceProcessorFactory. this new factory is a refactoring of the one Jan attached to SOLR-2825 , changed to utilize the base classes created in this issue. it has all of the same core functionality as the original, but because of the base class, can support all of the field selector options i've mentioned before, instead of a single comma seperated list of field names. I also added some additional error checking that doesn't exist in the SOLR-2825 version related to more explicit error messages when params are missing or patterns are invalid. in spite of these additions, this version is only half as many lines of code. (due to the base class reuse)
          Hide
          Jan Høydahl added a comment -

          Sweet You got there before me

          Show
          Jan Høydahl added a comment - Sweet You got there before me
          Hide
          Hoss Man added a comment -

          Polished up the previous patch...

          • all TODOs & nocommits gone
          • javadocs with example config for every factory
          • locked down everything as final and/or protected unless there was a reason to be otherwise
          • added more tests
          • renamed Regex*p*ReplaceProcessorFactory RegexReplaceProcessorFactory (didn't notice the naming before, pretty sure "Regex" is more common these days)

          I'm ready to commit this and open new (smaller) issues to iteratate on the other subclass ideas mentioned in this issue.

          speak now or forever hold your pieces!

          Show
          Hoss Man added a comment - Polished up the previous patch... all TODOs & nocommits gone javadocs with example config for every factory locked down everything as final and/or protected unless there was a reason to be otherwise added more tests renamed Regex*p*ReplaceProcessorFactory RegexReplaceProcessorFactory (didn't notice the naming before, pretty sure "Regex" is more common these days) I'm ready to commit this and open new (smaller) issues to iteratate on the other subclass ideas mentioned in this issue. speak now or forever hold your pieces!
          Hide
          Hoss Man added a comment -

          Committed revision 1242514.

          Given the minimal feedback given on these APIs so far, i'm only committing to trunk for now so they can be soak a bit.

          but if there is interest, and no one thinks the APIs need changed, backporting to 3x should be easy.

          Show
          Hoss Man added a comment - Committed revision 1242514. Given the minimal feedback given on these APIs so far, i'm only committing to trunk for now so they can be soak a bit. but if there is interest, and no one thinks the APIs need changed, backporting to 3x should be easy.
          Hide
          Robert Muir added a comment -

          I think the min/max is strange: it catches ClassCastException (as hoss explained on dev list):

           * If this is not the case, then the full list of all values found will be 
           * used as is.
          

          In my opinion, I think if a user asks for min or max or some other computation, and this is not possible,
          it should return an error? otherwise why did they configure this in their chain?

          I don't think its friendly to just sneakily act as a no-op instead...

          Moreover, the root cause of the eclipse-compiler problems I think is because its not taking Comparable?
          I think min/max should not extend this type-unsafe Subset base, as they should not return a subset anyway,
          but a singleton, and the input must be comparable...

          Show
          Robert Muir added a comment - I think the min/max is strange: it catches ClassCastException (as hoss explained on dev list): * If this is not the case, then the full list of all values found will be * used as is. In my opinion, I think if a user asks for min or max or some other computation, and this is not possible, it should return an error? otherwise why did they configure this in their chain? I don't think its friendly to just sneakily act as a no-op instead... Moreover, the root cause of the eclipse-compiler problems I think is because its not taking Comparable? I think min/max should not extend this type-unsafe Subset base, as they should not return a subset anyway, but a singleton, and the input must be comparable...
          Hide
          Hoss Man added a comment -

          In my opinion, I think if a user asks for min or max or some other computation, and this is not possible, it should return an error? otherwise why did they configure this in their chain?

          Agreed, i'm not sure what usecase i had in mind when i wrote min/max to "pass through" in this situation, but failing hard is definitely better – at least by default. If someone comes up with a reason not to fail, we can always add an option later.

          I've committed this change in r1242625.

          I think min/max should not extend this type-unsafe Subset base, as they should not return a subset anyway, but a singleton, and the input must be comparable...

          If you'd like to take a stab at refactoring by all means be me guest. It's true, these instances don't need to return a subset, but even if we change them to not subclass that particular base class, I don't see any simple way to rewrite them such that they only accept a Collection<Comparable>. UpdateProcessors deal with SolrInputDocuments & SolrInputFields that are just bags of objects; the schema hasn't been consulted yet, so we don't have any hard type information about the types of these Objects (and even if we could we wouldn't want to consult the schema yet, because some of these "fields" might be for input purposes only – some UpdateProcessor down the pipe might be copying/moving them to different fields).

          So if you want these Min/Max processors to have APIs that strictly enforce Collection<Comparable<T>>, then some code somewhere needs to check that and cast appropriately – at the moment, they delegate that responsibility to Collections.min and Collections.max, because that class does that check anyway as it dos it's computation.

          Personally i think the current impl is better anyway because in the common case of clients sending "clean data" we don't waste cycles checking the type of every Object sent before asking Collections.class to find the min/max and doing the check again anyway. if an exceptional case happens, we catch/log/wrap the exception.

          Show
          Hoss Man added a comment - In my opinion, I think if a user asks for min or max or some other computation, and this is not possible, it should return an error? otherwise why did they configure this in their chain? Agreed, i'm not sure what usecase i had in mind when i wrote min/max to "pass through" in this situation, but failing hard is definitely better – at least by default. If someone comes up with a reason not to fail, we can always add an option later. I've committed this change in r1242625. I think min/max should not extend this type-unsafe Subset base, as they should not return a subset anyway, but a singleton, and the input must be comparable... If you'd like to take a stab at refactoring by all means be me guest. It's true, these instances don't need to return a subset, but even if we change them to not subclass that particular base class, I don't see any simple way to rewrite them such that they only accept a Collection<Comparable>. UpdateProcessors deal with SolrInputDocuments & SolrInputFields that are just bags of objects; the schema hasn't been consulted yet, so we don't have any hard type information about the types of these Objects (and even if we could we wouldn't want to consult the schema yet, because some of these "fields" might be for input purposes only – some UpdateProcessor down the pipe might be copying/moving them to different fields). So if you want these Min/Max processors to have APIs that strictly enforce Collection<Comparable<T>>, then some code somewhere needs to check that and cast appropriately – at the moment, they delegate that responsibility to Collections.min and Collections.max, because that class does that check anyway as it dos it's computation. Personally i think the current impl is better anyway because in the common case of clients sending "clean data" we don't waste cycles checking the type of every Object sent before asking Collections.class to find the min/max and doing the check again anyway. if an exceptional case happens, we catch/log/wrap the exception.
          Hide
          Robert Muir added a comment -

          but failing hard is definitely better – at least by default.

          Thanks, I think this makes it a lot less scary!

          UpdateProcessors deal with SolrInputDocuments & SolrInputFields that are just bags of objects

          ok, looking around I can see this is a larger type-safety issue that shouldn't be addressed
          on this issue.

          Thanks for fixing the change (and fixing the eclipse compile).

          Show
          Robert Muir added a comment - but failing hard is definitely better – at least by default. Thanks, I think this makes it a lot less scary! UpdateProcessors deal with SolrInputDocuments & SolrInputFields that are just bags of objects ok, looking around I can see this is a larger type-safety issue that shouldn't be addressed on this issue. Thanks for fixing the change (and fixing the eclipse compile).

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development