Solr
  1. Solr
  2. SOLR-2823

Re-use of UpdateProcessor configurations in multiple UpdateChains

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: update
    • Labels:
      None

      Description

      When dealing with multiple UpdateChains and Processors, you frequently need to re-use configuration. Two chains may be equal except for one config setting in one <processor>.

      I propose to allow named processor configs, which can be referenced by name in the chains.

        Activity

        Hide
        Jan Høydahl added a comment -

        This could look like:

        <updateRequestProcessorChain name="crawl">
          <processor class="com.example.MyCrawlSpecificProcessor" />
          <processor ref="langid" />
          <processor class="solr.RunUpdateProcessorFactory" />
        </updateRequestProcessorChain>
        
        <updateRequestProcessorChain name="cms">
          <processor class="com.example.MyCmsSpecificProcessor" />
          <processor ref="langid" />
          <processor class="solr.RunUpdateProcessorFactory" />
        </updateRequestProcessorChain>
        
        <updateProcessors>
          <processor name="langid" class="solr.LanguageIdentifierUpdateProcessorFactory">
            <str name="langid.fl">text,title,subject,description</str>
            <str name="langid.langField">language_s</str>
            <str name="langid.fallback">en</str>
          </processor>
        </updateProcessors>
        
        Show
        Jan Høydahl added a comment - This could look like: <updateRequestProcessorChain name= "crawl" > <processor class= "com.example.MyCrawlSpecificProcessor" /> <processor ref= "langid" /> <processor class= "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <updateRequestProcessorChain name= "cms" > <processor class= "com.example.MyCmsSpecificProcessor" /> <processor ref= "langid" /> <processor class= "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <updateProcessors> <processor name= "langid" class= "solr.LanguageIdentifierUpdateProcessorFactory" > <str name= "langid.fl" > text,title,subject,description </str> <str name= "langid.langField" > language_s </str> <str name= "langid.fallback" > en </str> </processor> </updateProcessors>
        Hide
        Hoss Man added a comment -

        Jan: Seems like it would also make sense to think about common "sub-chains" that are re-used – sequences of processors used in conjunction with one another. Since we already have named processor chains, maybe it would be enough to say that (instead of naming individual processors) you could specify any chain, by name, as a sub-chain...

        <updateRequestProcessorChain name="crawl">
          <processor class="com.example.MyCrawlSpecificProcessor" />
          <subchain ref="common-chain" />
        </updateRequestProcessorChain>
        
        <updateRequestProcessorChain name="cms">
          <processor class="com.example.MyCmsSpecificProcessor" />
          <subchain ref="common-chain" />
        </updateRequestProcessorChain>
        
        <updateRequestProcessorChain name="common-chain">
          <processor name="langid" class="solr.LanguageIdentifierUpdateProcessorFactory">
            <str name="langid.fl">text,title,subject,description</str>
            <str name="langid.langField">language_s</str>
            <str name="langid.fallback">en</str>
          </processor>
          <processor class="solr.RunUpdateProcessorFactory" />
        </updateRequestProcessorChain>
        

        ...what do you think?

        (probably have to watch out for infinite loops, but that should be a fairly straightforward check when instantiating the chains, and we can fail fast.)

        Show
        Hoss Man added a comment - Jan: Seems like it would also make sense to think about common "sub-chains" that are re-used – sequences of processors used in conjunction with one another. Since we already have named processor chains, maybe it would be enough to say that (instead of naming individual processors) you could specify any chain, by name, as a sub-chain... <updateRequestProcessorChain name= "crawl" > <processor class= "com.example.MyCrawlSpecificProcessor" /> <subchain ref= "common-chain" /> </updateRequestProcessorChain> <updateRequestProcessorChain name= "cms" > <processor class= "com.example.MyCmsSpecificProcessor" /> <subchain ref= "common-chain" /> </updateRequestProcessorChain> <updateRequestProcessorChain name= "common-chain" > <processor name= "langid" class= "solr.LanguageIdentifierUpdateProcessorFactory" > <str name= "langid.fl" >text,title,subject,description</str> <str name= "langid.langField" >language_s</str> <str name= "langid.fallback" >en</str> </processor> <processor class= "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> ...what do you think? (probably have to watch out for infinite loops, but that should be a fairly straightforward check when instantiating the chains, and we can fail fast.)
        Hide
        Erik Hatcher added a comment -

        and next thing you know, you'll have recreated Ant's task/datatype/reference/plugin infrastructure

        Show
        Erik Hatcher added a comment - and next thing you know, you'll have recreated Ant's task/datatype/reference/plugin infrastructure
        Hide
        Chris Male added a comment -

        You make a great point Erik, this sort of thing surely has been considered before. Bean declarations in Spring also come to mind. Are we able to leverage any existing implementations / ideas?

        Show
        Chris Male added a comment - You make a great point Erik, this sort of thing surely has been considered before. Bean declarations in Spring also come to mind. Are we able to leverage any existing implementations / ideas?
        Hide
        Jan Høydahl added a comment -

        Sub chains could solve the exact example, but that was just for showing the principle. I think (optinally) named processors are a more direct solution. Think of the named processors as processor configs, not necessarily 1:1 with Java objects. When instansiating the "crawl" chain we'd simply fetch the config from the referenced element instead of inline. It may still have a distinct "solr.LanguageIdentifierUpdateProcessorFactory" class instance from the "cms" pipeline.

        Sub chains may also come in handy for some situations, but that could be handled separately later, if needed.

        Show
        Jan Høydahl added a comment - Sub chains could solve the exact example, but that was just for showing the principle. I think (optinally) named processors are a more direct solution. Think of the named processors as processor configs, not necessarily 1:1 with Java objects. When instansiating the "crawl" chain we'd simply fetch the config from the referenced element instead of inline. It may still have a distinct "solr.LanguageIdentifierUpdateProcessorFactory" class instance from the "cms" pipeline. Sub chains may also come in handy for some situations, but that could be handled separately later, if needed.
        Hide
        Erik Hatcher added a comment - - edited

        Bean declarations in Spring also come to mind. Are we able to leverage any existing implementations / ideas?

        perish the thought!

        My fairly snarky comment about recreating an "Ant" container for very rich bean setting and executing capabilities was a bit hyperbolic.

        Next someone will want an :

        <if test="${doc.langid} == 'xyz'">
          <next-processor-chain/>
          <else>
            <a-different-processor-chain/>
          </else>
        </if>
        

        So... (and I'm not proposing this since I've not got an implementation to contribute, but bits and pieces of it already are): a ScriptProcessor mechanism, so that when you need logic and code you can, umm, write some code.

        my_update_processor.rb
        ----------------------
        # totally contrived example and syntax
        LogUpdateProcessor docs
        
        case command
          when :add
        
            docs.each { |doc|
              LangIdProcessor doc, {:lang_id = '... options ...'}
              if doc[:lang] == 'fr'
                SpecialFrenchProcessor doc
                doc[:special] = true
              else
                NonFrenchProcessor doc
              end
            end
            next.process_add
          when :delete
            next.process_delete
          else
            next # or raise "Unsupported command"
        end
        
        RunUpdateProcessor docs
        

        Anyway, you can see where I'm coming from with that example (regardless of contrived "DSL").

        In other words, this XML nonsense for configuration is a slippery slope.

        Show
        Erik Hatcher added a comment - - edited Bean declarations in Spring also come to mind. Are we able to leverage any existing implementations / ideas? perish the thought! My fairly snarky comment about recreating an "Ant" container for very rich bean setting and executing capabilities was a bit hyperbolic. Next someone will want an : < if test= "${doc.langid} == 'xyz'" > <next-processor-chain/> < else > <a-different-processor-chain/> </ else > </ if > So... (and I'm not proposing this since I've not got an implementation to contribute, but bits and pieces of it already are): a ScriptProcessor mechanism, so that when you need logic and code you can, umm, write some code. my_update_processor.rb ---------------------- # totally contrived example and syntax LogUpdateProcessor docs case command when :add docs.each { |doc| LangIdProcessor doc, {:lang_id = '... options ...'} if doc[:lang] == 'fr' SpecialFrenchProcessor doc doc[:special] = true else NonFrenchProcessor doc end end next.process_add when :delete next.process_delete else next # or raise "Unsupported command" end RunUpdateProcessor docs Anyway, you can see where I'm coming from with that example (regardless of contrived "DSL"). In other words, this XML nonsense for configuration is a slippery slope.
        Hide
        Chris Male added a comment -

        I don't think it is hyperbolic to suggest that we're marching towards workflow definitions, something Ant does very well.

        Given we're already discussing: a) allowing people to write the actual processing logic in JS b) creating a standard set of simple processing functions and c) in this issue wanting to separate function definitions from workflows, an interpreted DSL sounds like a damn good idea.

        Show
        Chris Male added a comment - I don't think it is hyperbolic to suggest that we're marching towards workflow definitions, something Ant does very well. Given we're already discussing: a) allowing people to write the actual processing logic in JS b) creating a standard set of simple processing functions and c) in this issue wanting to separate function definitions from workflows, an interpreted DSL sounds like a damn good idea.
        Hide
        Jan Høydahl added a comment -

        Hey guys, you're jumping fast here

        Erik, you must have peeked in my ideas book because exactly what you propose is something I planned to introduce later, but using Groovy as the DSL - much like Gradle does. I think this could be achieved by making UpdateProcessorChains pluggable and definable in solrconfig. The DefaultUpdateProcessorChain could be the simple linear array[] of processors. The ScriptedUpdateProcessorChain would be the powerhouse where you could do both simple linear ones as well as complex logic. You can even do simple document manipulation inline without calling a processor, such as doc.deleteField("title")...

        This approach also solves another wish of mine, namely being able to define chains outside of solrconfig.xml. Logically, configuring schema and document processing is done by a "content" guy, but configuring solrconfig.xml is done by the "hardware/operations" guys. Imagine a solr/conf/pipeline.groovy defined in solrconfig.xml:

        <updateProcessorChain class="solr.ScriptedUpdateProcessorChainFactory" file="pipeline.groovy" />
        

        pipeline.groovy:

        chain simple {
          process(langid)
          process(copyfield)
          chain(logAndRun)
        }
        
        chain moreComplex {
          process(langid)
          if(doc.getFieldValue("employees") > 10)
            process(copyfield)
          else
            chain(myOtherProcesses)
          doc.deleteField("title")
          chain(logAndRun)
        }
        
        chain logAndRun {
          process(log)
          process(run)
        }
        
        processor langid {
          class = "solr.LanguageIdentifierUpdateProcessorFactory"
          config("langid.fl", "title,body")
          config("langid.langField", "language")
          config("map", true)
        }
        
        processor copyfield {
          script = "copyfield.groovy"
          config("from", "title")
          config("to", "title_en")
        }
        

        I don't know what it takes to code such a thing, but if we had it, I'd never go back to defining pipelines in XML

        Show
        Jan Høydahl added a comment - Hey guys, you're jumping fast here Erik, you must have peeked in my ideas book because exactly what you propose is something I planned to introduce later, but using Groovy as the DSL - much like Gradle does. I think this could be achieved by making UpdateProcessorChains pluggable and definable in solrconfig. The DefaultUpdateProcessorChain could be the simple linear array[] of processors. The ScriptedUpdateProcessorChain would be the powerhouse where you could do both simple linear ones as well as complex logic. You can even do simple document manipulation inline without calling a processor, such as doc.deleteField("title")... This approach also solves another wish of mine, namely being able to define chains outside of solrconfig.xml. Logically, configuring schema and document processing is done by a "content" guy, but configuring solrconfig.xml is done by the "hardware/operations" guys. Imagine a solr/conf/pipeline.groovy defined in solrconfig.xml: <updateProcessorChain class= "solr.ScriptedUpdateProcessorChainFactory" file= "pipeline.groovy" /> pipeline.groovy: chain simple { process(langid) process(copyfield) chain(logAndRun) } chain moreComplex { process(langid) if (doc.getFieldValue( "employees" ) > 10) process(copyfield) else chain(myOtherProcesses) doc.deleteField( "title" ) chain(logAndRun) } chain logAndRun { process(log) process(run) } processor langid { class = "solr.LanguageIdentifierUpdateProcessorFactory" config( "langid.fl" , "title,body" ) config( "langid.langField" , "language" ) config( "map" , true ) } processor copyfield { script = "copyfield.groovy" config( "from" , "title" ) config( "to" , "title_en" ) } I don't know what it takes to code such a thing, but if we had it, I'd never go back to defining pipelines in XML
        Hide
        Chris Male added a comment -

        Not that I have anything against Groovy, but can't we achieve something like that using JS? We already have an issue (which Erik will remember the number of) where we support defining the actual processing logic. Can't we work from that?

        Show
        Chris Male added a comment - Not that I have anything against Groovy, but can't we achieve something like that using JS? We already have an issue (which Erik will remember the number of) where we support defining the actual processing logic. Can't we work from that?
        Hide
        Jan Høydahl added a comment -

        The beauty of Groovy in this setting is that it's got all the power of Java right there, everything is objects so you can call any method on any object, you could even do 4.times

        { processor(incrementor) }

        Language choice is a matter of taste anyway. The DSL should have such defaults that novice users don't even know that they are programming when creating their workflows/chains.

        Show
        Jan Høydahl added a comment - The beauty of Groovy in this setting is that it's got all the power of Java right there, everything is objects so you can call any method on any object, you could even do 4.times { processor(incrementor) } Language choice is a matter of taste anyway. The DSL should have such defaults that novice users don't even know that they are programming when creating their workflows/chains.
        Hide
        Hoss Man added a comment -

        I think (optinally) named processors are a more direct solution. Think of the named processors as processor configs, not necessarily 1:1 with Java objects. When instansiating the "crawl" chain we'd simply fetch the config from the referenced element instead of inline. It may still have a distinct "solr.LanguageIdentifierUpdateProcessorFactory" class instance from the "cms" pipeline.

        To be clear: i wasn't arguing that the subchain syntax i suggested would be 1:1 with java objects either, it could work exactly as you intend named processors to work (ie: pure syntactic sugar). my suggestions was just that if we allow "subchain by reference" type configuration, it would achieve everything you describe using named processors (because you could have a sub-chain containing a single processor) and it would handle the common case of chains that have a lot in common, but do some extra stuff at the beginning/end.

        Sub chains may also come in handy for some situations, but that could be handled separately later, if needed.

        Eh .. i guess .. but it seems like it would be less total work to just do subchains and let people use chains of one processor to deal with the "reusing individual processor configs" usecase(s).

        (FWIW: Don't let my comments stop/dissuade you, either way would be a huge improvement over what we've got now ... i just wanted to point out something that seemed like more bang for the buck)

        Show
        Hoss Man added a comment - I think (optinally) named processors are a more direct solution. Think of the named processors as processor configs, not necessarily 1:1 with Java objects. When instansiating the "crawl" chain we'd simply fetch the config from the referenced element instead of inline. It may still have a distinct "solr.LanguageIdentifierUpdateProcessorFactory" class instance from the "cms" pipeline. To be clear: i wasn't arguing that the subchain syntax i suggested would be 1:1 with java objects either, it could work exactly as you intend named processors to work (ie: pure syntactic sugar). my suggestions was just that if we allow "subchain by reference" type configuration, it would achieve everything you describe using named processors (because you could have a sub-chain containing a single processor) and it would handle the common case of chains that have a lot in common, but do some extra stuff at the beginning/end. Sub chains may also come in handy for some situations, but that could be handled separately later, if needed. Eh .. i guess .. but it seems like it would be less total work to just do subchains and let people use chains of one processor to deal with the "reusing individual processor configs" usecase(s). (FWIW: Don't let my comments stop/dissuade you, either way would be a huge improvement over what we've got now ... i just wanted to point out something that seemed like more bang for the buck)
        Hide
        Jan Høydahl added a comment -

        The more I think about the scriptable chain/workflow option, the more I'd like to go that direction and gain full freedom, rather than patch the way of configuring via XML. I've created SOLR-2841 to continue that discussion. This JIRA can then be dedicated to improvements to the current XML config.

        Show
        Jan Høydahl added a comment - The more I think about the scriptable chain/workflow option, the more I'd like to go that direction and gain full freedom, rather than patch the way of configuring via XML. I've created SOLR-2841 to continue that discussion. This JIRA can then be dedicated to improvements to the current XML config.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development