Solr
  1. Solr
  2. SOLR-4892

Add field update processors to parse/convert String-typed fields to Date, Number, and Boolean

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 5.0
    • Component/s: update
    • Labels:
      None

      Description

      Add FieldMutatingUpdateProcessorFactory subclasses ParseFooUpdateProcessorFactory, where Foo includes Date, Double, Long, and Boolean.

      These factories will have a default selector that matches all fields that either don’t match any schema field, or are in the schema with the corresponding typeClass. They can optionally support a list of multiple format specifiers. If they see a value that is not a CharSequence, or can't parse the value using a configured format, they ignore it and leave it as is.

      For multi-valued fields, these processors will not convert any values unless all are first successfully parsed. Ordering the processors [Boolean, Long, Double, Date] will allow e.g. values [2, 5, 8.6] to be left alone by the Boolean and Long processors, but then converted by the Double processor.

      1. SOLR-4892.patch
        133 kB
        Steve Rowe
      2. SOLR-4892.patch
        122 kB
        Steve Rowe
      3. SOLR-4892.patch
        102 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Steve Rowe added a comment -

          Patch, I think it's ready to go.

          I've added Joda-time 2.2 as a solr-core dependency to handle the configurable date parsing.

          I'd appreciate review.

          Show
          Steve Rowe added a comment - Patch, I think it's ready to go. I've added Joda-time 2.2 as a solr-core dependency to handle the configurable date parsing. I'd appreciate review.
          Hide
          Hoss Man added a comment -

          Steve: for the most part the patch looks good, but a couple of things jumped out at me though as seeming kind of fishy...

          1) AllValuesOrNoneFieldMutatingUpdateProcessor semantics

          The "if (destVal == srcVal) ... do nothing" logic seems like a trap for future devs who want to subclass – not to mention i'm not 100% certain i agree it makes sense for the subclasses you've already written. (If I send the list [new String("6.0"), new Double(7.3D), new String("4.3")] to ParseDOubleUpdateProcessor should it really reject the entire list because one of the original values is already a double?)

          I think it would be more straight forward to give subclasses another singleton marker object (similar to the DELETE singleton) they can return from mutateValue() to say "abort processing this list i have encountered an object i do not want". If mutateValue() returns the same object it was passed in, that should be considered a successful mutation.

          2) ParsingFieldUpdateProcessor overriding processAdd

          Some code duplication here that worries me. If we want to make the change I suggested above for parsing strings in lists where we can and allowing type specific objects if they are already the correct type, then (if i'm thinking about this correctly) we don't need this class at all, classes like ParseFooFieldUpdateProcessor could easily just handle the CharSequence check as part of their basic logic...

            if (val instanceof CharSequence) {
               val = ... attempt parse as a Foo...;
            }
            if (val instanceof Foo) return val;
            return SKIP_FIELD_VALUE_LIST_SINGLETON
          

          But even if we want to keep the current semantics of AllValuesOrNoneFieldMutatingUpdateProcessor, then i think it would be cleaner if ParsingFieldUpdateProcessor was changed to be really simple, ala...

          ParsingFieldUpdateProcessor extends AllValuesOrNoneFieldMutatingUpdateProcessor {
            public ParsingFieldUpdateProcessor(FieldNameSelector selector, UpdateRequestProcessor next) {
              super(selector, next);
            }
            protected final Object mutateValue(final Object srcVal) {
              return (srcVal instanceof CharSequence) ? 
                parse(srcVal.toString()) : SKIP_FIELD_VALUE_LIST_SINGLETON;
            }
            protected abstract Object parse(final Object srcVal);
          }
          

          ...and then the concrete subclasses just implement "parse"

          3) either way we go with #2, that means we don't need the access modifieier change on 'selector' in FieldMutatingUpdateProcessor.

          4) double check the handling of field boosts in AllValuesOrNoneFieldMutatingUpdateProcessor ... pretty sure it isn't being preserved (see FieldValueMutatingUpdateProcessor for example)

          5) The SortableFooField FieldType's should also be decorated with your new interfaces.

          6) Can we refactor the locale init param parsing into a helper utility class or a common factory bsae class so it's not duplicated in both ParseNumericFieldUpdateProcessorFactory and ParseDateFieldUpdateProcessorFactory ?


          And some misc questions...

          7) Why no ParseIntFieldUpdateProcessor and ParseFloatFieldUpdateProcessor ?

          8) would it be simpler for users to have a single "locale" init param that accepted things like "ru" and "ru_RU" and "ru_RU_xxx" using LocaleUtils.toLocale in commons-lang?

          9) we should be able to make the ParseFooFieldUpdateProcessor inner classses into static inner classes just by passing hte locale into the constructor and right?

          Show
          Hoss Man added a comment - Steve: for the most part the patch looks good, but a couple of things jumped out at me though as seeming kind of fishy... 1) AllValuesOrNoneFieldMutatingUpdateProcessor semantics The "if (destVal == srcVal) ... do nothing" logic seems like a trap for future devs who want to subclass – not to mention i'm not 100% certain i agree it makes sense for the subclasses you've already written. (If I send the list [new String("6.0"), new Double(7.3D), new String("4.3")] to ParseDOubleUpdateProcessor should it really reject the entire list because one of the original values is already a double?) I think it would be more straight forward to give subclasses another singleton marker object (similar to the DELETE singleton) they can return from mutateValue() to say "abort processing this list i have encountered an object i do not want". If mutateValue() returns the same object it was passed in, that should be considered a successful mutation. 2) ParsingFieldUpdateProcessor overriding processAdd Some code duplication here that worries me. If we want to make the change I suggested above for parsing strings in lists where we can and allowing type specific objects if they are already the correct type, then (if i'm thinking about this correctly) we don't need this class at all, classes like ParseFooFieldUpdateProcessor could easily just handle the CharSequence check as part of their basic logic... if (val instanceof CharSequence) { val = ... attempt parse as a Foo...; } if (val instanceof Foo) return val; return SKIP_FIELD_VALUE_LIST_SINGLETON But even if we want to keep the current semantics of AllValuesOrNoneFieldMutatingUpdateProcessor, then i think it would be cleaner if ParsingFieldUpdateProcessor was changed to be really simple, ala... ParsingFieldUpdateProcessor extends AllValuesOrNoneFieldMutatingUpdateProcessor { public ParsingFieldUpdateProcessor(FieldNameSelector selector, UpdateRequestProcessor next) { super (selector, next); } protected final Object mutateValue( final Object srcVal) { return (srcVal instanceof CharSequence) ? parse(srcVal.toString()) : SKIP_FIELD_VALUE_LIST_SINGLETON; } protected abstract Object parse( final Object srcVal); } ...and then the concrete subclasses just implement "parse" 3) either way we go with #2, that means we don't need the access modifieier change on 'selector' in FieldMutatingUpdateProcessor. 4) double check the handling of field boosts in AllValuesOrNoneFieldMutatingUpdateProcessor ... pretty sure it isn't being preserved (see FieldValueMutatingUpdateProcessor for example) 5) The SortableFooField FieldType's should also be decorated with your new interfaces. 6) Can we refactor the locale init param parsing into a helper utility class or a common factory bsae class so it's not duplicated in both ParseNumericFieldUpdateProcessorFactory and ParseDateFieldUpdateProcessorFactory ? And some misc questions... 7) Why no ParseIntFieldUpdateProcessor and ParseFloatFieldUpdateProcessor ? 8) would it be simpler for users to have a single "locale" init param that accepted things like "ru" and "ru_RU" and "ru_RU_xxx" using LocaleUtils.toLocale in commons-lang? 9) we should be able to make the ParseFooFieldUpdateProcessor inner classses into static inner classes just by passing hte locale into the constructor and right?
          Hide
          Steve Rowe added a comment -

          Thanks Hoss, thanks for the review, I'll work on your suggestions and put up a patch later today.

          About your questions:

          7) Why no ParseIntFieldUpdateProcessor and ParseFloatFieldUpdateProcessor ?

          Because I wasn't going to use them for the schemaless mode example config (which is the driver here), I was keeping the issue lean, but I can add them.

          8) would it be simpler for users to have a single "locale" init param that accepted things like "ru" and "ru_RU" and "ru_RU_xxx" using LocaleUtils.toLocale in commons-lang?

          I didn't know about the commons-lang method, but I separated the locale components for two reasons: 1) Java 7 has a new method Locale.forLanguageTag() that does this, but for 4.X, we can't use it, since it's on Java 6; and 2) there is a precedent for separately configured locale components in BreakIteratorBoundaryScanner: HighlightParams.BS_LANGUAGE and .BS_COUNTRY ("hl.bs.language" and "hl.bs.country", respectively). I'm fine switching to LocaleUtils.toLocale with a single "locale" configuration item, though; I agree it would be simpler for users.

          9) we should be able to make the ParseFooFieldUpdateProcessor inner classses into static inner classes just by passing hte locale into the constructor and right?

          Yes, I'll do that.

          Show
          Steve Rowe added a comment - Thanks Hoss, thanks for the review, I'll work on your suggestions and put up a patch later today. About your questions: 7) Why no ParseIntFieldUpdateProcessor and ParseFloatFieldUpdateProcessor ? Because I wasn't going to use them for the schemaless mode example config (which is the driver here), I was keeping the issue lean, but I can add them. 8) would it be simpler for users to have a single "locale" init param that accepted things like "ru" and "ru_RU" and "ru_RU_xxx" using LocaleUtils.toLocale in commons-lang? I didn't know about the commons-lang method, but I separated the locale components for two reasons: 1) Java 7 has a new method Locale.forLanguageTag() that does this, but for 4.X, we can't use it, since it's on Java 6; and 2) there is a precedent for separately configured locale components in BreakIteratorBoundaryScanner: HighlightParams.BS_LANGUAGE and .BS_COUNTRY ("hl.bs.language" and "hl.bs.country", respectively). I'm fine switching to LocaleUtils.toLocale with a single "locale" configuration item, though; I agree it would be simpler for users. 9) we should be able to make the ParseFooFieldUpdateProcessor inner classses into static inner classes just by passing hte locale into the constructor and right? Yes, I'll do that.
          Hide
          Steve Rowe added a comment - - edited

          Chris Hostetter, this version of the patch includes fixes for all your suggestions.

          Details:

          1) AllValuesOrNoneFieldMutatingUpdateProcessor semantics

          The "if (destVal == srcVal) ... do nothing" logic [...] give subclasses another singleton marker object [...] they can return from mutateValue() to say "abort processing this list i have encountered an object i do not want". If mutateValue() returns the same object it was passed in, that should be considered a successful mutation.

          Done.

          2) ParsingFieldUpdateProcessor overriding processAdd

          [...] we don't need this class at all, classes like ParseFooFieldUpdateProcessor could easily just handle the CharSequence check as part of their basic logic...

          Done - I chose this route and got rid of ParsingFieldUpdateProcessor.

          3) either way we go with #2, that means we don't need the access modifieier change on 'selector' in FieldMutatingUpdateProcessor.

          Fixed.

          double check the handling of field boosts in AllValuesOrNoneFieldMutatingUpdateProcessor ... pretty sure it isn't being preserved (see FieldValueMutatingUpdateProcessor for example)

          Fixed.

          5) The SortableFooField FieldType's should also be decorated with your new interfaces.

          Done.

          6) Can we refactor the locale init param parsing into a helper utility class or a common factory bsae class so it's not duplicated in both ParseNumericFieldUpdateProcessorFactory and ParseDateFieldUpdateProcessorFactory ?

          8) would it be simpler for users to have a single "locale" init param that accepted things like "ru" and "ru_RU" and "ru_RU_xxx" using LocaleUtils.toLocale in commons-lang?

          I switched to using a single "locale" config item, using commons-lang LocaleUtils.toLocale(). This is so small I didn't try to consolidate, I just replicated in the two places.

          7) Why no ParseIntFieldUpdateProcessor and ParseFloatFieldUpdateProcessor ?

          Added.

          9) we should be able to make the ParseFooFieldUpdateProcessor inner classses into static inner classes just by passing hte locale into the constructor and right?

          I did this for the Date and Numeric processors, but the Boolean processor factory has a couple of configured collections and a boolean, so I didn't convert its processor to a static inner class.

          Show
          Steve Rowe added a comment - - edited Chris Hostetter , this version of the patch includes fixes for all your suggestions. Details: 1) AllValuesOrNoneFieldMutatingUpdateProcessor semantics The "if (destVal == srcVal) ... do nothing" logic [...] give subclasses another singleton marker object [...] they can return from mutateValue() to say "abort processing this list i have encountered an object i do not want". If mutateValue() returns the same object it was passed in, that should be considered a successful mutation. Done. 2) ParsingFieldUpdateProcessor overriding processAdd [...] we don't need this class at all, classes like ParseFooFieldUpdateProcessor could easily just handle the CharSequence check as part of their basic logic... Done - I chose this route and got rid of ParsingFieldUpdateProcessor. 3) either way we go with #2, that means we don't need the access modifieier change on 'selector' in FieldMutatingUpdateProcessor. Fixed. double check the handling of field boosts in AllValuesOrNoneFieldMutatingUpdateProcessor ... pretty sure it isn't being preserved (see FieldValueMutatingUpdateProcessor for example) Fixed. 5) The SortableFooField FieldType's should also be decorated with your new interfaces. Done. 6) Can we refactor the locale init param parsing into a helper utility class or a common factory bsae class so it's not duplicated in both ParseNumericFieldUpdateProcessorFactory and ParseDateFieldUpdateProcessorFactory ? 8) would it be simpler for users to have a single "locale" init param that accepted things like "ru" and "ru_RU" and "ru_RU_xxx" using LocaleUtils.toLocale in commons-lang? I switched to using a single "locale" config item, using commons-lang LocaleUtils.toLocale(). This is so small I didn't try to consolidate, I just replicated in the two places. 7) Why no ParseIntFieldUpdateProcessor and ParseFloatFieldUpdateProcessor ? Added. 9) we should be able to make the ParseFooFieldUpdateProcessor inner classses into static inner classes just by passing hte locale into the constructor and right? I did this for the Date and Numeric processors, but the Boolean processor factory has a couple of configured collections and a boolean, so I didn't convert its processor to a static inner class.
          Hide
          Hoss Man added a comment -

          +1 ... code looks good.

          two minor things i didn't notice before...

          a) i think there's some class javadoc cut/paste mistakes ... if you grep the patch for "solr.ParseDoubleFieldUpdateProcessorFactory" it shows up in the class jdocs for several other factories.

          b) the logic in ParseBooleanFieldUpdateProcessorFactory could probably be a little simpler/faster if there was a single Map<String,Boolean> instead of two Set<String> that each had to be checked.

          Show
          Hoss Man added a comment - +1 ... code looks good. two minor things i didn't notice before... a) i think there's some class javadoc cut/paste mistakes ... if you grep the patch for "solr.ParseDoubleFieldUpdateProcessorFactory" it shows up in the class jdocs for several other factories. b) the logic in ParseBooleanFieldUpdateProcessorFactory could probably be a little simpler/faster if there was a single Map<String,Boolean> instead of two Set<String> that each had to be checked.
          Hide
          Steve Rowe added a comment -

          Final patch.

          Beefed up tests and addressed Hoss's javadocs and map vs. two sets issues.

          Committing shortly.

          Show
          Steve Rowe added a comment - Final patch. Beefed up tests and addressed Hoss's javadocs and map vs. two sets issues. Committing shortly.
          Hide
          Steve Rowe added a comment -

          Committed:

          Thanks for the reviews Hoss!

          Show
          Steve Rowe added a comment - Committed: trunk r1497165 branch_4x r1497177 Thanks for the reviews Hoss!
          Hide
          Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          Steve Rowe added a comment - Bulk close resolved 4.4 issues

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development