Solr
  1. Solr
  2. SOLR-269

UpdateRequestProcessorFactory - process requests before submitting them

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: None
    • Labels:
      None

      Description

      A simple UpdateRequestProcessor was added to a bloated SOLR-133 commit.

      An UpdateRequestProcessor lets clients plug in logic after a document has been parsed and before it has been 'updated' with the index. This is a good place to add custom logic for:

      • transforming the document fields
      • fine grained authorization (can user X updated document Y?)
      • allow update, but not delete (by query?)

      <requestHandler name="/update" class="solr.StaxUpdateRequestHandler" >
      <str name="update.processor.class">org.apache.solr.handler.UpdateRequestProcessor</str>
      <lst name="update.processor.args">
      ... (optionally pass in arguments to the factory init method) ...
      </lst>
      </requestHandler>

      http://www.nabble.com/Re%3A-svn-commit%3A-r547495---in--lucene-solr-trunk%3A-example-solr-conf-solrconfig.xml-src-java-org-apache-solr-handler-StaxUpdateRequestHandler.java-src-java-org-apache-solr-handler-UpdateRequestProcessor.jav-tf3950072.html#a11206583

      1. SOLR-269-simple.patch
        7 kB
        Noble Paul
      2. SOLR-269-UpdateRequestProcessorFactory.patch
        54 kB
        Ryan McKinley
      3. SOLR-269-UpdateRequestProcessorFactory.patch
        45 kB
        Ryan McKinley
      4. UpdateProcessor.patch
        18 kB
        Yonik Seeley
      5. SOLR-269-UpdateRequestProcessorFactory.patch
        6 kB
        Ryan McKinley

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Looking at UpdateRequestProcessor further, it seems like these should be singletons (instance per entry in solrconfig, no factory needed), and any extra state that is needed
          should be added to classes we already have (like AddCommand, etc), no?

          Show
          Yonik Seeley added a comment - Looking at UpdateRequestProcessor further, it seems like these should be singletons (instance per entry in solrconfig, no factory needed), and any extra state that is needed should be added to classes we already have (like AddCommand, etc), no?
          Hide
          Yonik Seeley added a comment -

          I think the newly added incremental time should not be on by default, as well as logging per id for deletes and adds.
          Mike added the id aggregation code specifically because logging each add was taking so much time.

          Show
          Yonik Seeley added a comment - I think the newly added incremental time should not be on by default, as well as logging per id for deletes and adds. Mike added the id aggregation code specifically because logging each add was taking so much time.
          Hide
          Ryan McKinley added a comment -

          maybe. I'm not sure I totally understand your suggestion though.

          I need something that is easily subclassed and can cleanly holds state across an entire request cycle. The alternative is to pass the SolrQueryRequest/Response into each action and maybe pull out the schema/updateHandler/logged in user/etc for each command (each document in the list of 100)

          Is the factory a performance concern? (to my tastes) it seems nicer to work with:

          processDelete( DeleteUpdateCommand cmd )
          {
          if( user.isAdmin() )

          { updateHandler.delete( cmd ); }


          else

          { ... }

          }

          than:

          processDelete( DeleteUpdateCommand cmd, SolrQueryRequest req, SolrQueryResponse rsp )
          {
          User user = req.getContext().get( "user" );
          if( user.isAdmin() )

          { SolrCore core = req.getCore(); SolrSchema schema = core.getSchema(); UpdateHandler updateHandler = core.getUpdateHandler(); updateHandler.delete( cmd ); }


          else

          { ... }

          }

          I'm fine either way, like the easy 1 per-request interface.

          Show
          Ryan McKinley added a comment - maybe. I'm not sure I totally understand your suggestion though. I need something that is easily subclassed and can cleanly holds state across an entire request cycle. The alternative is to pass the SolrQueryRequest/Response into each action and maybe pull out the schema/updateHandler/logged in user/etc for each command (each document in the list of 100) Is the factory a performance concern? (to my tastes) it seems nicer to work with: processDelete( DeleteUpdateCommand cmd ) { if( user.isAdmin() ) { updateHandler.delete( cmd ); } else { ... } } than: processDelete( DeleteUpdateCommand cmd, SolrQueryRequest req, SolrQueryResponse rsp ) { User user = req.getContext().get( "user" ); if( user.isAdmin() ) { SolrCore core = req.getCore(); SolrSchema schema = core.getSchema(); UpdateHandler updateHandler = core.getUpdateHandler(); updateHandler.delete( cmd ); } else { ... } } I'm fine either way, like the easy 1 per-request interface.
          Hide
          Ryan McKinley added a comment -

          > I think the newly added incremental time should not be on by default, as well as logging per id for deletes and adds.
          > Mike added the id aggregation code specifically because logging each add was taking so much time.

          sounds good. the testing I did showed that lots of time is spent in the logging phase.

          I will remove it from the default implementation.

          Show
          Ryan McKinley added a comment - > I think the newly added incremental time should not be on by default, as well as logging per id for deletes and adds. > Mike added the id aggregation code specifically because logging each add was taking so much time. sounds good. the testing I did showed that lots of time is spent in the logging phase. I will remove it from the default implementation.
          Hide
          Yonik Seeley added a comment -

          > I need something that is easily subclassed and can cleanly holds state across an entire request cycle.

          Having a factory and separate object so that one can use core instead of req.getCore(), etc, seems like overkill for the normal case though since
          getCore(), getSchema(), getUpdateHandler() all just return instance variables. I was thinking any state like that could be on the UpdateCommand.

          I'd like to have potentially several request processors, but if people start doing single doc add requests, instantiating and initializing all those request processors will get expensive.

          I do see your usecase though, in the case of multiple docs per add and you have some expensive state you only want to calculate once.
          If it's a relatively rare case, one could put it in the request context.
          The tradeoff would be an extra hash lookup per-document of a multi-document add vs an extra object creation for single-doc adds.

          Different Q on usage: is this where my document mutator stuff should go??? If I want a transformation done on a field, regardless of where the data is coming from (XML update handler, CSV update handler, future REST update handler, etc), how should that be done? Is there a single place I can register a plugin to do this, and is UpdateRequestProcessor where you see it happening?

          Show
          Yonik Seeley added a comment - > I need something that is easily subclassed and can cleanly holds state across an entire request cycle. Having a factory and separate object so that one can use core instead of req.getCore(), etc, seems like overkill for the normal case though since getCore(), getSchema(), getUpdateHandler() all just return instance variables. I was thinking any state like that could be on the UpdateCommand. I'd like to have potentially several request processors, but if people start doing single doc add requests, instantiating and initializing all those request processors will get expensive. I do see your usecase though, in the case of multiple docs per add and you have some expensive state you only want to calculate once. If it's a relatively rare case, one could put it in the request context. The tradeoff would be an extra hash lookup per-document of a multi-document add vs an extra object creation for single-doc adds. Different Q on usage: is this where my document mutator stuff should go??? If I want a transformation done on a field, regardless of where the data is coming from (XML update handler, CSV update handler, future REST update handler, etc), how should that be done? Is there a single place I can register a plugin to do this, and is UpdateRequestProcessor where you see it happening?
          Hide
          Hoss Man added a comment -

          > Different Q on usage: is this where my document mutator stuff should go??? If I want a transformation done on a field, regardless of
          > where the data is coming from (XML update handler, CSV update handler, future REST update handler, etc), how should that be done?
          > Is there a single place I can register a plugin to do this, and is UpdateRequestProcessor where you see it happening?

          i believe that was acutally the initial intent of UpdateRequestProcesso, note the javadocs...

          • This is a good place for subclassed update handlers to process the document before it is
          • indexed. You may wish to add/remove fields or check if the requested user is allowed to
          • update the given document...
          • Perhaps you continue adding an error message (without indexing the document)...
          • perhaps you throw an error and halt indexing (remove anything already indexed??)
          Show
          Hoss Man added a comment - > Different Q on usage: is this where my document mutator stuff should go??? If I want a transformation done on a field, regardless of > where the data is coming from (XML update handler, CSV update handler, future REST update handler, etc), how should that be done? > Is there a single place I can register a plugin to do this, and is UpdateRequestProcessor where you see it happening? i believe that was acutally the initial intent of UpdateRequestProcesso, note the javadocs... This is a good place for subclassed update handlers to process the document before it is indexed. You may wish to add/remove fields or check if the requested user is allowed to update the given document... Perhaps you continue adding an error message (without indexing the document)... perhaps you throw an error and halt indexing (remove anything already indexed??)
          Hide
          Ryan McKinley added a comment -

          > getCore(), getSchema(), getUpdateHandler() all just return instance variables.
          >

          ok, getCore() isn't a good canidate; It is annoying to start every function with a train wreck: req.getCore().getUpdateHandler()

          Having a single class per request makes sense for a subclass I am working with – it does some expensive initialization and stores the results. I could put this in req.getContext()

          > instantiating and initializing all those request processors will get expensive.

          Really? the default initialize is trivial - stuff that would happen at the beginning of every function anyway. I suppose GC could be an issue

          > I do see your usecase though, in the case of multiple docs per add and you have some expensive state you only want to calculate once.

          In r552986, I changed the logging to match solr 1.2 – this required accumulating the id's and spitting them out at the end. In 1.2 with processing and parsing entwined, this was just a giant loop. To get the same behavior we need to stash it somewhere...

          > Different Q on usage: is this where my document mutator stuff should go???

          Yes. The intent is to have a simple place between document parsing and indexing where you can do whatever you need to do. Any parsing strategy (XML,JSON,etc) could share the same processor.

          Looking at SOLR-139, I now think the most flexible/useful way to support modifiable documents is to build utility functions for the UpdateProcessor that can manipulate SolrInputDocuments.

          • - -

          I will take another crack at SOLR-139 implemented in the UpdateProcessor, then we should return to the question of singleton vs factory - trying to work with a more complex processor may make this choice more obvious.

          Show
          Ryan McKinley added a comment - > getCore(), getSchema(), getUpdateHandler() all just return instance variables. > ok, getCore() isn't a good canidate; It is annoying to start every function with a train wreck: req.getCore().getUpdateHandler() Having a single class per request makes sense for a subclass I am working with – it does some expensive initialization and stores the results. I could put this in req.getContext() > instantiating and initializing all those request processors will get expensive. Really? the default initialize is trivial - stuff that would happen at the beginning of every function anyway. I suppose GC could be an issue > I do see your usecase though, in the case of multiple docs per add and you have some expensive state you only want to calculate once. In r552986, I changed the logging to match solr 1.2 – this required accumulating the id's and spitting them out at the end. In 1.2 with processing and parsing entwined, this was just a giant loop. To get the same behavior we need to stash it somewhere... > Different Q on usage: is this where my document mutator stuff should go??? Yes. The intent is to have a simple place between document parsing and indexing where you can do whatever you need to do. Any parsing strategy (XML,JSON,etc) could share the same processor. Looking at SOLR-139 , I now think the most flexible/useful way to support modifiable documents is to build utility functions for the UpdateProcessor that can manipulate SolrInputDocuments. - - I will take another crack at SOLR-139 implemented in the UpdateProcessor, then we should return to the question of singleton vs factory - trying to work with a more complex processor may make this choice more obvious.
          Hide
          Hoss Man added a comment -

          > we should return to the question of singleton vs factory - trying to work with a more complex processor may make this choice more obvious.

          i'm not really sure if i understand the issue ... but if it's a question of performance in the default case then i don't really see an issue – a Factory API can return a Singleton provided the impl is threadsafe (iving us all the performance goodness of a Singleton) but switching to a Singleton API really limits what people can do when they want to have a complex UpdateRequestProcessor and know it might take a while.

          Yonik, would your concerns be relieved if the default UpdateRequestProcessorFactory class was changed to look like this...

          public class UpdateRequestProcessorFactory {
          private final UpdateRequestProcessor SINGLETON;
          public UpdateRequestProcessorFactory()

          { /*NOOP*/}

          public void init( NamedList<Object> args )

          { SINGLETON = new UpdateRequestProcessor( req ); }

          public UpdateRequestProcessor getInstance( SolrQueryRequest req )

          { return SINGLETON }

          }

          ?

          Show
          Hoss Man added a comment - > we should return to the question of singleton vs factory - trying to work with a more complex processor may make this choice more obvious. i'm not really sure if i understand the issue ... but if it's a question of performance in the default case then i don't really see an issue – a Factory API can return a Singleton provided the impl is threadsafe (iving us all the performance goodness of a Singleton) but switching to a Singleton API really limits what people can do when they want to have a complex UpdateRequestProcessor and know it might take a while. Yonik, would your concerns be relieved if the default UpdateRequestProcessorFactory class was changed to look like this... public class UpdateRequestProcessorFactory { private final UpdateRequestProcessor SINGLETON; public UpdateRequestProcessorFactory() { /*NOOP*/} public void init( NamedList<Object> args ) { SINGLETON = new UpdateRequestProcessor( req ); } public UpdateRequestProcessor getInstance( SolrQueryRequest req ) { return SINGLETON } } ?
          Hide
          Yonik Seeley added a comment -

          FYI, I'm working up a prototype right now to handle multiple request processors.

          Show
          Yonik Seeley added a comment - FYI, I'm working up a prototype right now to handle multiple request processors.
          Hide
          Hoss Man added a comment -

          couldn't that just be a DelegateUpdateRequestProcessor that is constructed using a list of other UpdateRequestProcessors?

          Show
          Hoss Man added a comment - couldn't that just be a DelegateUpdateRequestProcessor that is constructed using a list of other UpdateRequestProcessors?
          Hide
          Yonik Seeley added a comment -

          OK, this patch adds the ability to specify multiple processor factories.

          Summary:

          • decoupled changing the index from logging... I think it makes it much clearer how things work (there was much more logging code than anything else). This also would allow Ryan to add back his incremental timing to a different processor.
          • added SolrInputDocument to AddUpdateCommand, and added some methods
          • removed NamedList return from the XML update handler and started passing SolrQueryResponse around instead (this is more future-proof and flexible)
          • removed adding all the ids to the response... (back to 1.2 response format). We probably shouldn't add ids by default... think of CSV uploading millions of records, etc.
          • An array of factories is kept, and when a processor is instantiated, it is passed a "next" pointer. An alternative would be to expose a "next factory" pointer to every factory (any advantage to having the current factory call getInstance() on the next factory instead of us doing that?)
          • untested support for basic syntax to support multiple update processors:
            <lst name="update.processor>
            <str name="factory">org.apache.solr....</str>
            <lst name="args">...</lst>
          Show
          Yonik Seeley added a comment - OK, this patch adds the ability to specify multiple processor factories. Summary: decoupled changing the index from logging... I think it makes it much clearer how things work (there was much more logging code than anything else). This also would allow Ryan to add back his incremental timing to a different processor. added SolrInputDocument to AddUpdateCommand, and added some methods removed NamedList return from the XML update handler and started passing SolrQueryResponse around instead (this is more future-proof and flexible) removed adding all the ids to the response... (back to 1.2 response format). We probably shouldn't add ids by default... think of CSV uploading millions of records, etc. An array of factories is kept, and when a processor is instantiated, it is passed a "next" pointer. An alternative would be to expose a "next factory" pointer to every factory (any advantage to having the current factory call getInstance() on the next factory instead of us doing that?) untested support for basic syntax to support multiple update processors: <lst name="update.processor> <str name="factory">org.apache.solr....</str> <lst name="args">...</lst>
          Hide
          Yonik Seeley added a comment -

          > couldn't that just be a DelegateUpdateRequestProcessor that is constructed using a list of other UpdateRequestProcessors

          That might be the way to go if multiple processors were to be very rare... but then you need to come
          up with a syntax for the args of DelegateUpdateRequestProcessor to specify multiple delegates, and one ends up writing the same code with more complex configuration.

          Show
          Yonik Seeley added a comment - > couldn't that just be a DelegateUpdateRequestProcessor that is constructed using a list of other UpdateRequestProcessors That might be the way to go if multiple processors were to be very rare... but then you need to come up with a syntax for the args of DelegateUpdateRequestProcessor to specify multiple delegates, and one ends up writing the same code with more complex configuration.
          Hide
          Ryan McKinley added a comment -

          I like the AddUpdateCommand changes

          What do you see as the common use case for wanting to chain request processors? Is the LogUpdateRequestProcessor just an example?

          The one compelling chained use case I can think of is for document transformation. In SOLR-139, I toyed with SolrInputDocumentTransformer. The default case does nothing, and a subclass may use something like:
          for( SolrInputDocumentTransformer t : transformers )

          { doc = t.transform( doc ); }
          Show
          Ryan McKinley added a comment - I like the AddUpdateCommand changes What do you see as the common use case for wanting to chain request processors? Is the LogUpdateRequestProcessor just an example? The one compelling chained use case I can think of is for document transformation. In SOLR-139 , I toyed with SolrInputDocumentTransformer. The default case does nothing, and a subclass may use something like: for( SolrInputDocumentTransformer t : transformers ) { doc = t.transform( doc ); }
          Hide
          Yonik Seeley added a comment -

          > What do you see as the common use case for wanting to chain request processors?

          • conditional copyField, field transformations (between multiple fields too... something Analyzer can't do), loading certain fields from a database if missing,
            updating a related document, etc.

          > Is the LogUpdateRequestProcessor just an example?

          IMO, it's a default since no logging is done by the ChangeUpdateRequestProcessor (anyone think of a better name for that?).
          Then in a Benchmarking section of the Solr Wiki, we could advise to remove logging altogether. Or you could remove the ChangeUpdateRequestProcessor to skip index
          changes to better benchmark hotspots in the parsing + doc creation phase, etc.

          > The one compelling chained use case I can think of is for document transformation

          Ah, I briefly looked at SOLR-139 when you mentioned it before, but missed the transformer stuff.
          In a way multiple update processors are more generic and wide open... you could actually insert two documents into the index for each doc added,
          you could do transforms on the actual Lucene document (add Field options that Solr doesn't currently support, etc.

          Show
          Yonik Seeley added a comment - > What do you see as the common use case for wanting to chain request processors? conditional copyField, field transformations (between multiple fields too... something Analyzer can't do), loading certain fields from a database if missing, updating a related document, etc. > Is the LogUpdateRequestProcessor just an example? IMO, it's a default since no logging is done by the ChangeUpdateRequestProcessor (anyone think of a better name for that?). Then in a Benchmarking section of the Solr Wiki, we could advise to remove logging altogether. Or you could remove the ChangeUpdateRequestProcessor to skip index changes to better benchmark hotspots in the parsing + doc creation phase, etc. > The one compelling chained use case I can think of is for document transformation Ah, I briefly looked at SOLR-139 when you mentioned it before, but missed the transformer stuff. In a way multiple update processors are more generic and wide open... you could actually insert two documents into the index for each doc added, you could do transforms on the actual Lucene document (add Field options that Solr doesn't currently support, etc.
          Hide
          Yonik Seeley added a comment -

          Some other issues.... how to configure processors for multiple update handlers? Perhaps allow configuration of a global default for update handlers with no processors specified? That would make it easy to make sure your custom processor was used everywhere.
          We should probably have a base class for update handlers to implement initialization logic.

          Show
          Yonik Seeley added a comment - Some other issues.... how to configure processors for multiple update handlers? Perhaps allow configuration of a global default for update handlers with no processors specified? That would make it easy to make sure your custom processor was used everywhere. We should probably have a base class for update handlers to implement initialization logic.
          Hide
          Ryan McKinley added a comment -

          > - conditional copyField, field transformations (between multiple fields too... something Analyzer can't do), loading certain fields from a database if missing,
          > updating a related document, etc.
          >

          This all works nicely with a simple 'transform' chain.

          >> Is the LogUpdateRequestProcessor just an example?
          >
          > IMO, it's a default since no logging is done by the ChangeUpdateRequestProcessor (anyone think of a better name for that?).
          > Then in a Benchmarking section of the Solr Wiki, we could advise to remove logging altogether. Or you could remove the ChangeUpdateRequestProcessor to skip index
          > changes to better benchmark hotspots in the parsing + doc creation phase, etc.
          >

          Isn't logging best configured with standard java.util.logging settings? If necessary, the base processor could check if the logging level is high enough to keep track of somethings.

          For benchmarking, don't we just want a single noop processor?

          >> The one compelling chained use case I can think of is for document transformation
          >
          > Ah, I briefly looked at SOLR-139 when you mentioned it before, but missed the transformer stuff.
          > In a way multiple update processors are more generic and wide open... you could actually insert two documents into the index for each doc added,
          > you could do transforms on the actual Lucene document (add Field options that Solr doesn't currently support, etc.
          >

          I see what you are getting at, but makes the basic cases more complicated then it needs to be. I have been considering UpdateRequestProcessor as an 'advanced' option where changing their behavior is writing custom code – not text configuration.

          In the advanced case where you want to build multiple documents or munge the actual Lucene document existing it may be more difficult to live in a chain rather then have explicit control. If

          I think the cleanest design would be a single entry point and keeping the real functionality in easily subclassed functions or utility classes. The latest SOLR-139 tries that (but it could still use some cleanup)

          Show
          Ryan McKinley added a comment - > - conditional copyField, field transformations (between multiple fields too... something Analyzer can't do), loading certain fields from a database if missing, > updating a related document, etc. > This all works nicely with a simple 'transform' chain. >> Is the LogUpdateRequestProcessor just an example? > > IMO, it's a default since no logging is done by the ChangeUpdateRequestProcessor (anyone think of a better name for that?). > Then in a Benchmarking section of the Solr Wiki, we could advise to remove logging altogether. Or you could remove the ChangeUpdateRequestProcessor to skip index > changes to better benchmark hotspots in the parsing + doc creation phase, etc. > Isn't logging best configured with standard java.util.logging settings? If necessary, the base processor could check if the logging level is high enough to keep track of somethings. For benchmarking, don't we just want a single noop processor? >> The one compelling chained use case I can think of is for document transformation > > Ah, I briefly looked at SOLR-139 when you mentioned it before, but missed the transformer stuff. > In a way multiple update processors are more generic and wide open... you could actually insert two documents into the index for each doc added, > you could do transforms on the actual Lucene document (add Field options that Solr doesn't currently support, etc. > I see what you are getting at, but makes the basic cases more complicated then it needs to be. I have been considering UpdateRequestProcessor as an 'advanced' option where changing their behavior is writing custom code – not text configuration. In the advanced case where you want to build multiple documents or munge the actual Lucene document existing it may be more difficult to live in a chain rather then have explicit control. If I think the cleanest design would be a single entry point and keeping the real functionality in easily subclassed functions or utility classes. The latest SOLR-139 tries that (but it could still use some cleanup)
          Hide
          Ryan McKinley added a comment -

          > Some other issues.... how to configure processors for multiple update handlers? Perhaps allow configuration of a global default for update handlers with no processors specified? That would make it easy to make sure your custom processor was used everywhere.

          SolrCore could have a single UpdateRequestProcessorFactory that handlers could use as the default. I'm reluctant to add another plugin layer, but this would make it easier to share with the CSV update handler and others.

          Since its a factory, it will be thread safe across multiple handlers.

          Again, I'm reluctant to think about configuring a processor chain in solrconfig.xml – we should make the most sensible/extendible default implementation, but IMO tweeking RequestProcessor functionality should be done with custom code.

          Show
          Ryan McKinley added a comment - > Some other issues.... how to configure processors for multiple update handlers? Perhaps allow configuration of a global default for update handlers with no processors specified? That would make it easy to make sure your custom processor was used everywhere. SolrCore could have a single UpdateRequestProcessorFactory that handlers could use as the default. I'm reluctant to add another plugin layer, but this would make it easier to share with the CSV update handler and others. Since its a factory, it will be thread safe across multiple handlers. Again, I'm reluctant to think about configuring a processor chain in solrconfig.xml – we should make the most sensible/extendible default implementation, but IMO tweeking RequestProcessor functionality should be done with custom code.
          Hide
          Yonik Seeley added a comment -

          > This all works nicely with a simple 'transform' chain.

          I don't see any code allowing initialization of the transform chain, or anything like that.
          What are you proposing?

          I think it gets tougher when one talks about updates rather than document adds too.

          > Isn't logging best configured with standard java.util.logging settings?

          Not if you want a different type... for example, you seemed to want timing per-document added for example.
          Having log specific configuration (such as number of ids to log in a big add) in it's own processor seems slightly nicer too.
          Is there a downside to the logging being separated out in this case? I really don't have strong feelings about it though (as long as we can keep the default version lean enough).

          > I see what you are getting at, but makes the basic cases more complicated then it needs to be.

          Yes, with the upside that we know our transform interface isn't too limiting.

          > I'm reluctant to add another plugin layer

          Me too... which is why just doing transform with the processors seems desirable (one less type of plugin).
          If transformations are not to be done with UpdateRequestProcessor, I'm not sure we should expose that interface at all as it's tightly coupled with DUH2.
          It seems we really only need one type of plugin to do these document transformations.

          > SolrCore could have a single UpdateRequestProcessorFactory that handlers could use as the default.

          Yes, with update processors, the needed interface to add a document changes... you need the processor rather than the update handler.

          Show
          Yonik Seeley added a comment - > This all works nicely with a simple 'transform' chain. I don't see any code allowing initialization of the transform chain, or anything like that. What are you proposing? I think it gets tougher when one talks about updates rather than document adds too. > Isn't logging best configured with standard java.util.logging settings? Not if you want a different type... for example, you seemed to want timing per-document added for example. Having log specific configuration (such as number of ids to log in a big add) in it's own processor seems slightly nicer too. Is there a downside to the logging being separated out in this case? I really don't have strong feelings about it though (as long as we can keep the default version lean enough). > I see what you are getting at, but makes the basic cases more complicated then it needs to be. Yes, with the upside that we know our transform interface isn't too limiting. > I'm reluctant to add another plugin layer Me too... which is why just doing transform with the processors seems desirable (one less type of plugin). If transformations are not to be done with UpdateRequestProcessor, I'm not sure we should expose that interface at all as it's tightly coupled with DUH2. It seems we really only need one type of plugin to do these document transformations. > SolrCore could have a single UpdateRequestProcessorFactory that handlers could use as the default. Yes, with update processors, the needed interface to add a document changes... you need the processor rather than the update handler.
          Hide
          Yonik Seeley added a comment -

          > with update processors, the needed interface to add a document changes... you need the processor rather than the update handler.

          Thinking on these lines a bit further... perhaps the extra functionality of transformations and updating should be pushed into the UpdateHandler interface
          (DUH2). If it makes sense, we could deprecate the existing AddUpdateCommand & methods in favor of new ones that use SolrInputDocument.

          Show
          Yonik Seeley added a comment - > with update processors, the needed interface to add a document changes... you need the processor rather than the update handler. Thinking on these lines a bit further... perhaps the extra functionality of transformations and updating should be pushed into the UpdateHandler interface (DUH2). If it makes sense, we could deprecate the existing AddUpdateCommand & methods in favor of new ones that use SolrInputDocument.
          Hide
          Ryan McKinley added a comment -

          > perhaps the extra functionality of transformations and updating should be pushed into the UpdateHandler interface

          That was the first SOLR-139 design!

          Having thought about it for a while, i think there are nice advantages to keeping the updating/modifying outside of the UpdateHandler - the biggest one is that various RequestHandlers could transform the document differently.

          I'm putting together a hybrid example that (I hope) answers questions about chains/configuration/transformation, etc. I'll post it shortly.

          Show
          Ryan McKinley added a comment - > perhaps the extra functionality of transformations and updating should be pushed into the UpdateHandler interface That was the first SOLR-139 design! Having thought about it for a while, i think there are nice advantages to keeping the updating/modifying outside of the UpdateHandler - the biggest one is that various RequestHandlers could transform the document differently. I'm putting together a hybrid example that (I hope) answers questions about chains/configuration/transformation, etc. I'll post it shortly.
          Hide
          Ryan McKinley added a comment -
          • loads UpdateRequestProcessorFactories using the plugin loader stuff from SOLR-260.
          • makes a UpdateRequestHandlerBase class that XML and CSV share.
          • loads and configures a chain of InputTransformations.
          • moves UpdateRequestProcessor from o.a.s.handler to o.a.s.update

          This is the test config:

          <updateRequestProcessor>
          <factory name="standard" class="solr.UpdateRequestProcessorFactory" />
          <factory name="custom" class="solr.CustomUpdateRequestProcessorFactory" default="true">
          <args>
          <lst name="name">
          <str name="n1">v1</str>
          <str name="n2">v2</str>
          </lst>
          </args>
          <transformer class="solr.CustomTransformerNoOp" />
          <transformer class="solr.CustomTransformerNoOp" />
          <transformer class="solr.CustomTransformerAdd">
          <str name="f0">000</str>
          <str name="f1">111</str>
          </transformer>
          <transformer class="solr.CustomTransformerNoOp" />
          </factory>
          </updateRequestProcessor>

          I am not sure we want to make the transformer interface public yet, but it is here to show how I think it could be handled.

          If you like this approach, I'll clean it up some more...

          Show
          Ryan McKinley added a comment - loads UpdateRequestProcessorFactories using the plugin loader stuff from SOLR-260 . makes a UpdateRequestHandlerBase class that XML and CSV share. loads and configures a chain of InputTransformations. moves UpdateRequestProcessor from o.a.s.handler to o.a.s.update This is the test config: <updateRequestProcessor> <factory name="standard" class="solr.UpdateRequestProcessorFactory" /> <factory name="custom" class="solr.CustomUpdateRequestProcessorFactory" default="true"> <args> <lst name="name"> <str name="n1">v1</str> <str name="n2">v2</str> </lst> </args> <transformer class="solr.CustomTransformerNoOp" /> <transformer class="solr.CustomTransformerNoOp" /> <transformer class="solr.CustomTransformerAdd"> <str name="f0">000</str> <str name="f1">111</str> </transformer> <transformer class="solr.CustomTransformerNoOp" /> </factory> </updateRequestProcessor> I am not sure we want to make the transformer interface public yet, but it is here to show how I think it could be handled. If you like this approach, I'll clean it up some more...
          Hide
          Ryan McKinley added a comment -

          > I don't see any code allowing initialization of the transform chain, or anything like that.
          > What are you proposing?

          Check the latest patch. I'm not sure we need to have XML configuration for this, but I added it as an example. The factory would hold a list of transformers:
          <transformer class="MyConditionalCopyField" />
          <transformer class="SynonymCleaner" />
          <transformer class="RunAnalyzerAndStoreAsMultiValuedFields">

          > Is there a downside to the logging being separated out in this case?

          only that it justifies the UpdateRequestProcessorFactory chain

          • - - - -

          This patch includes a UpdateRequestProcessorChainFactory - this is a UpdateRequestProcessorFactory that keeps a chain of UpdateRequestProcessorFactories and iterates through them. This pointed out another thing I don't like about the chain pattern.

          I need a custom UpdateRequestProcessor that checks all the requests before executing any of them. I plan to store the valid commands in a list and only execute them in the finish() call. I'm not sure how to map that plan to an chain. How would I pass the output from one processor to the next?

          Show
          Ryan McKinley added a comment - > I don't see any code allowing initialization of the transform chain, or anything like that. > What are you proposing? Check the latest patch. I'm not sure we need to have XML configuration for this, but I added it as an example. The factory would hold a list of transformers: <transformer class="MyConditionalCopyField" /> <transformer class="SynonymCleaner" /> <transformer class="RunAnalyzerAndStoreAsMultiValuedFields"> > Is there a downside to the logging being separated out in this case? only that it justifies the UpdateRequestProcessorFactory chain - - - - This patch includes a UpdateRequestProcessorChainFactory - this is a UpdateRequestProcessorFactory that keeps a chain of UpdateRequestProcessorFactories and iterates through them. This pointed out another thing I don't like about the chain pattern. I need a custom UpdateRequestProcessor that checks all the requests before executing any of them. I plan to store the valid commands in a list and only execute them in the finish() call. I'm not sure how to map that plan to an chain. How would I pass the output from one processor to the next?
          Hide
          Yonik Seeley added a comment -

          > I'm not sure we need to have XML configuration for this

          If we have those multiple update processor factories, I agree we don't need XML config for the transformers.

          > I need a custom UpdateRequestProcessor that checks all the requests before executing any of them. I plan to store the valid commands in a list and only execute them in the finish() call. I'm not sure how to map that plan to an chain. How would I pass the output from one processor to the next?

          I had thought of that use-case too (bulk operations), which is why I added explicit flow contol (explicit calling of next.handleAdd() in the processor).
          You can buffer up all the requests (you want to clone the UpdateCommands as they might be reused though) and not call "next".
          Then in finish, you can delegate all of the buffered commands.

          Show
          Yonik Seeley added a comment - > I'm not sure we need to have XML configuration for this If we have those multiple update processor factories, I agree we don't need XML config for the transformers. > I need a custom UpdateRequestProcessor that checks all the requests before executing any of them. I plan to store the valid commands in a list and only execute them in the finish() call. I'm not sure how to map that plan to an chain. How would I pass the output from one processor to the next? I had thought of that use-case too (bulk operations), which is why I added explicit flow contol (explicit calling of next.handleAdd() in the processor). You can buffer up all the requests (you want to clone the UpdateCommands as they might be reused though) and not call "next". Then in finish, you can delegate all of the buffered commands.
          Hide
          Yonik Seeley added a comment -

          What about working off the version I attached (or at least folding in those changes)? It had a bunch of changes that I prefer, including

          • explicit flow control between processors for greatest flexibility
          • removal of NamedList return (as you say, chaining those makes less sense anyway)
          • already extracted and optimized the complex (or rather bigger) logging logic from the simple index updating
          • passed in SolrQueryResponse as well, enabling a processor to change the response
          Show
          Yonik Seeley added a comment - What about working off the version I attached (or at least folding in those changes)? It had a bunch of changes that I prefer, including explicit flow control between processors for greatest flexibility removal of NamedList return (as you say, chaining those makes less sense anyway) already extracted and optimized the complex (or rather bigger) logging logic from the simple index updating passed in SolrQueryResponse as well, enabling a processor to change the response
          Hide
          Ryan McKinley added a comment -

          The only one I'm not sure about is:

          • explicit flow control between processors for greatest flexibility

          I'm still trying to avoid the parent UpdateRequestProcessorFactory chain as a default behavior. It seems fine as a super-duper custom controlller, but unurly in the default/slightly custom case.

          Folding in:

          • removal of NamedList return (as you say, chaining those makes less sense anyway)
          • already extracted and optimized the complex (or rather bigger) logging logic from the simple index updating
          • passed in SolrQueryResponse as well, enabling a processor to change the response
            is no problem.

          If you like the general structure / flow of SOLR-269-UpdateRequestProcessorFactory.patch, I'll clean it up and work in this stuff. Otherwise I'll look at how to make UpdateRequestProcessorFactory[] feel more palatable.

          Show
          Ryan McKinley added a comment - The only one I'm not sure about is: explicit flow control between processors for greatest flexibility I'm still trying to avoid the parent UpdateRequestProcessorFactory chain as a default behavior. It seems fine as a super-duper custom controlller, but unurly in the default/slightly custom case. Folding in: removal of NamedList return (as you say, chaining those makes less sense anyway) already extracted and optimized the complex (or rather bigger) logging logic from the simple index updating passed in SolrQueryResponse as well, enabling a processor to change the response is no problem. If you like the general structure / flow of SOLR-269 -UpdateRequestProcessorFactory.patch, I'll clean it up and work in this stuff. Otherwise I'll look at how to make UpdateRequestProcessorFactory[] feel more palatable.
          Hide
          Yonik Seeley added a comment -

          > The only one I'm not sure about is:
          > - explicit flow control between processors for greatest flexibility

          It's a single call per hook:
          if (next != null) next.processAdd();

          And it's exactly what you need for your "buffering" situation.
          Chaining is the model that Lucene uses for it's analyzers too (only difference is that it's a pull instead of a push).

          > I'm still trying to avoid the parent UpdateRequestProcessorFactory chain as a default behavior. It seems fine as a super-duper custom controlller, but unurly in the default/slightly custom case.

          I'm not clear on why... the configuration is more complex?

          > If you like the general structure / flow of SOLR-269-UpdateRequestProcessorFactory.patch

          I'm not sure about the named processors... are they needed?
          It seems like we need a "standard" one that is used by default everywhere,
          and then maybe we need to be able to change them per-handler. Do we need this up front, or could it be deferred?

          It seems like there does need to be a method on SolrCore to get a RequestProcessor or Factory, since that becomes
          the new interface to do an index change (otherwise you miss the doc transformations, etc).

          > Otherwise I'll look at how to make UpdateRequestProcessorFactory[] feel more palatable.

          That could be wrapped in another UpdateRequestProcessorFactory if desired... it doesn't matter much if the impl is hidden by a class or a method IMO.

          Show
          Yonik Seeley added a comment - > The only one I'm not sure about is: > - explicit flow control between processors for greatest flexibility It's a single call per hook: if (next != null) next.processAdd(); And it's exactly what you need for your "buffering" situation. Chaining is the model that Lucene uses for it's analyzers too (only difference is that it's a pull instead of a push). > I'm still trying to avoid the parent UpdateRequestProcessorFactory chain as a default behavior. It seems fine as a super-duper custom controlller, but unurly in the default/slightly custom case. I'm not clear on why... the configuration is more complex? > If you like the general structure / flow of SOLR-269 -UpdateRequestProcessorFactory.patch I'm not sure about the named processors... are they needed? It seems like we need a "standard" one that is used by default everywhere, and then maybe we need to be able to change them per-handler. Do we need this up front, or could it be deferred? It seems like there does need to be a method on SolrCore to get a RequestProcessor or Factory, since that becomes the new interface to do an index change (otherwise you miss the doc transformations, etc). > Otherwise I'll look at how to make UpdateRequestProcessorFactory[] feel more palatable. That could be wrapped in another UpdateRequestProcessorFactory if desired... it doesn't matter much if the impl is hidden by a class or a method IMO.
          Hide
          Ryan McKinley added a comment -

          > It's a single call per hook:
          > if (next != null) next.processAdd();
          >

          Ok. I'm convinced.

          >
          > I'm not sure about the named processors... are they needed?
          > It seems like we need a "standard" one that is used by default everywhere,
          > and then maybe we need to be able to change them per-handler. Do we need this up front, or could it be deferred?

          I'm not sure. The only reason I think we may want to do it now is to keep the initialization standard and in a single place. If we declare a default processor and have each handler optionally initialize their own, the config may look different. RequestHandlers only have access to a NamedList while initialized, they can't (without serious changes) declare something like:
          <requestHandler ...>
          <updateProcessor class="" />
          </requestHandler>

          With that in mind, I think it best to build the updateProcessors using the standard PluginLoader framework and then have RequestHandlers access them by name.

          >
          >> Otherwise I'll look at how to make UpdateRequestProcessorFactory[] feel more palatable.
          >
          > That could be wrapped in another UpdateRequestProcessorFactory if desired... it doesn't matter much if the impl is hidden by a class or a method IMO.

          Ok, I'll start with UpdateProcessor.patch and fold in my changes.

          Show
          Ryan McKinley added a comment - > It's a single call per hook: > if (next != null) next.processAdd(); > Ok. I'm convinced. > > I'm not sure about the named processors... are they needed? > It seems like we need a "standard" one that is used by default everywhere, > and then maybe we need to be able to change them per-handler. Do we need this up front, or could it be deferred? I'm not sure. The only reason I think we may want to do it now is to keep the initialization standard and in a single place. If we declare a default processor and have each handler optionally initialize their own, the config may look different. RequestHandlers only have access to a NamedList while initialized, they can't (without serious changes) declare something like: <requestHandler ...> <updateProcessor class="" /> </requestHandler> With that in mind, I think it best to build the updateProcessors using the standard PluginLoader framework and then have RequestHandlers access them by name. > >> Otherwise I'll look at how to make UpdateRequestProcessorFactory[] feel more palatable. > > That could be wrapped in another UpdateRequestProcessorFactory if desired... it doesn't matter much if the impl is hidden by a class or a method IMO. Ok, I'll start with UpdateProcessor.patch and fold in my changes.
          Hide
          Ryan McKinley added a comment -

          New version that started with Yonik's patch. This masks the 'Chain' in a ChainedUpdateProcessorFactory. By default it has two elements:
          1. RunUpdateProcessor
          2. LogUpdateProcessor

          I moved the processor stuff to o.a.s.update.processor. The Classes need to be public for them to get created with the PluginLoader.

          This still has a map<String,Factory> in core.

          Here is the XML configuration I am using in the testing:

          <updateRequestProcessor>
          <factory name="standard" class="solr.ChainedUpdateProcessorFactory" >
          <chain class="solr.LogUpdateProcessorFactory" >
          <int name="maxNumToLog">100</int>
          </chain>
          <chain class="solr.CustomUpdateRequestProcessorFactory" >
          <lst name="name">
          <str name="n1">x1</str>
          <str name="n2">x2</str>
          </lst>
          </chain>
          <chain class="solr.CustomUpdateRequestProcessorFactory" >
          <lst name="name">
          <str name="nA">xA</str>
          <str name="nA">xA</str>
          </lst>
          </chain>
          </factory>

          <factory name="custom" class="solr.CustomUpdateRequestProcessorFactory" default="true" >
          <lst name="name">
          <str name="n8">88</str>
          <str name="n9">99</str>
          </lst>
          </factory>
          </updateRequestProcessor>

          I'm starting to like the structure.

          thoughts?

          Show
          Ryan McKinley added a comment - New version that started with Yonik's patch. This masks the 'Chain' in a ChainedUpdateProcessorFactory. By default it has two elements: 1. RunUpdateProcessor 2. LogUpdateProcessor I moved the processor stuff to o.a.s.update.processor. The Classes need to be public for them to get created with the PluginLoader. This still has a map<String,Factory> in core. Here is the XML configuration I am using in the testing: <updateRequestProcessor> <factory name="standard" class="solr.ChainedUpdateProcessorFactory" > <chain class="solr.LogUpdateProcessorFactory" > <int name="maxNumToLog">100</int> </chain> <chain class="solr.CustomUpdateRequestProcessorFactory" > <lst name="name"> <str name="n1">x1</str> <str name="n2">x2</str> </lst> </chain> <chain class="solr.CustomUpdateRequestProcessorFactory" > <lst name="name"> <str name="nA">xA</str> <str name="nA">xA</str> </lst> </chain> </factory> <factory name="custom" class="solr.CustomUpdateRequestProcessorFactory" default="true" > <lst name="name"> <str name="n8">88</str> <str name="n9">99</str> </lst> </factory> </updateRequestProcessor> I'm starting to like the structure. thoughts?
          Hide
          Yonik Seeley added a comment -

          One issue: the current way of having a custom processor (CustomUpdateRequestHandler) seems less than ideal.
          First is that CustomUpdateRequestHandler extends XMLUpdateRequestHandler.... but what if I want one for CSV, etc.
          If update processors are to be a first-class part of Solr, it seems like one should be able to specify the processor to use
          for any update handler (CSV, XML, etc) without having to write extra classes for those.

          Perhaps something like:
          <requestHandler name="/customupdate" class="solr.XmlUpdateRequestHandler" >
          <str name="update.processor">standard</str>
          </requestHandler>

          Show
          Yonik Seeley added a comment - One issue: the current way of having a custom processor (CustomUpdateRequestHandler) seems less than ideal. First is that CustomUpdateRequestHandler extends XMLUpdateRequestHandler.... but what if I want one for CSV, etc. If update processors are to be a first-class part of Solr, it seems like one should be able to specify the processor to use for any update handler (CSV, XML, etc) without having to write extra classes for those. Perhaps something like: <requestHandler name="/customupdate" class="solr.XmlUpdateRequestHandler" > <str name="update.processor">standard</str> </requestHandler>
          Hide
          Ryan McKinley added a comment -

          Hymm. the behavior on trunk is:
          <requestHandler name="/customupdate" class="solr.XmlUpdateRequestHandler" >
          <str name="update.processor.factory">class name for factory</str>
          </requestHandler>
          The latest patch has the argument lookup an XML configured factory.

          Do you mean:

          <requestHandler name="/customupdate" class="solr.XmlUpdateRequestHandler" >
          <lst name="invariants">
          <str name="update.processor">standard</str>
          </lst>
          </requestHandler>

          Given the direction we are heading, it seems nice to be able to change the update behavior from:
          /update?update.processor=do-fancy-document-cleanup
          /update?update.processor=go-quick-i-know-the-docs-are-clean

          I made it a 1-1 relation (processor-handler) to avoid a hash lookup for each request, but from a pram would be ok.

          Show
          Ryan McKinley added a comment - Hymm. the behavior on trunk is: <requestHandler name="/customupdate" class="solr.XmlUpdateRequestHandler" > <str name="update.processor.factory">class name for factory</str> </requestHandler> The latest patch has the argument lookup an XML configured factory. Do you mean: <requestHandler name="/customupdate" class="solr.XmlUpdateRequestHandler" > <lst name="invariants"> <str name="update.processor">standard</str> </lst> </requestHandler> Given the direction we are heading, it seems nice to be able to change the update behavior from: /update?update.processor=do-fancy-document-cleanup /update?update.processor=go-quick-i-know-the-docs-are-clean I made it a 1-1 relation (processor-handler) to avoid a hash lookup for each request, but from a pram would be ok.
          Hide
          Yonik Seeley added a comment -

          > I made it a 1-1 relation (processor-handler) to avoid a hash lookup for each request,

          That was my thinking too... I wasn't suggesting making it an overrideable parameter, but I'm not really against it either.

          Show
          Yonik Seeley added a comment - > I made it a 1-1 relation (processor-handler) to avoid a hash lookup for each request, That was my thinking too... I wasn't suggesting making it an overrideable parameter, but I'm not really against it either.
          Hide
          Hoss Man added a comment -

          the idea of letting people override the processor on a per request basis seems very scary and depending on what kinds of stuff yo uwere expecting hte processor to do, could introduce some serous bugs ... but then again, if it's a param, it can be specified as an invariant if you want to ensure that doesn't happen.

          i guess hte key thing is just that the RequestHandler has final say over what Processor gets used ... we can provide handy tools/conventions to get that info from the config or the request, but a very simplistic RequestHandler should be able to hardcode it for absolute control.

          Show
          Hoss Man added a comment - the idea of letting people override the processor on a per request basis seems very scary and depending on what kinds of stuff yo uwere expecting hte processor to do, could introduce some serous bugs ... but then again, if it's a param, it can be specified as an invariant if you want to ensure that doesn't happen. i guess hte key thing is just that the RequestHandler has final say over what Processor gets used ... we can provide handy tools/conventions to get that info from the config or the request, but a very simplistic RequestHandler should be able to hardcode it for absolute control.
          Hide
          Ryan McKinley added a comment -

          > the RequestHandler has final say over what Processor gets used

          absolutely! The question is just what do in the default /update case. I'm inclined to have the request say what processor to use. With 'invariants' that can be fixed to a single implementation, and will let people configure processors without a custom handler.

          How do you all feel about the basic structure? I like the structure, but am not sure how 'public' to make the configuration and implementation. While it would be nice to keep the base stuff package protected, then we can't have external configuration and external classes could not reuse the other bits of the chain (defeating the 'chain' advantages)

          I have a pending deadline that depends on input processing and SOLR-139 modifiable documents – it would be great to work from a lightly patched trunk rather then a heavily patched one

          Show
          Ryan McKinley added a comment - > the RequestHandler has final say over what Processor gets used absolutely! The question is just what do in the default /update case. I'm inclined to have the request say what processor to use. With 'invariants' that can be fixed to a single implementation, and will let people configure processors without a custom handler. How do you all feel about the basic structure? I like the structure, but am not sure how 'public' to make the configuration and implementation. While it would be nice to keep the base stuff package protected, then we can't have external configuration and external classes could not reuse the other bits of the chain (defeating the 'chain' advantages) I have a pending deadline that depends on input processing and SOLR-139 modifiable documents – it would be great to work from a lightly patched trunk rather then a heavily patched one
          Hide
          Ryan McKinley added a comment -

          The latest patch on SOLR-139 includes a cleaned up version of SOLR-269. One clever change is to have the LogUpdateProcessorFactory skip building a LogUpdateProcessor if the log level is not INFO rather then keep a flag.

          Show
          Ryan McKinley added a comment - The latest patch on SOLR-139 includes a cleaned up version of SOLR-269 . One clever change is to have the LogUpdateProcessorFactory skip building a LogUpdateProcessor if the log level is not INFO rather then keep a flag.
          Hide
          Yonik Seeley added a comment -

          > How do you all feel about the basic structure?
          It's a go!
          It will get more complicated, I think, with document modification (SOLR-139)

          > While it would be nice to keep the base stuff package protected,

          I'm more concerned with the other parts of the API that this moves front-and-center...
          mainly UpdateCommand and friends... those were really quick hacks on my part since there were no custom "update" handlers at the time.

          > One clever change is to have the LogUpdateProcessorFactory skip building a LogUpdateProcessor if the log level is not INFO rather then keep a flag.

          Nice!

          I also need SOLR-139 btw, is it easy for you to commit this first to limit the size and scope of that patch?

          Show
          Yonik Seeley added a comment - > How do you all feel about the basic structure? It's a go! It will get more complicated, I think, with document modification ( SOLR-139 ) > While it would be nice to keep the base stuff package protected, I'm more concerned with the other parts of the API that this moves front-and-center... mainly UpdateCommand and friends... those were really quick hacks on my part since there were no custom "update" handlers at the time. > One clever change is to have the LogUpdateProcessorFactory skip building a LogUpdateProcessor if the log level is not INFO rather then keep a flag. Nice! I also need SOLR-139 btw, is it easy for you to commit this first to limit the size and scope of that patch?
          Hide
          Shalin Shekhar Mangar added a comment -

          I've read through most of the discussion here and the wiki page at http://wiki.apache.org/solr/UpdateRequestProcessor but I couldn't understand the reasons behind the current design.

          Looking at the configuration we have:

          <updateRequestProcessor>                                               
             <factory name="standard" class="solr.ChainedUpdateProcessorFactory" default="true" >
               <chain class="org.apache.solr.ConditionalCopyProcessorFactory" />                                    
               <chain class="solr.RunUpdateProcessorFactory" />                    
               <chain class="solr.LogUpdateProcessorFactory" />                   
             </factory>                                                           
           </updateRequestProcessor>
          

          Why can't it be written as:

          <updateRequestProcessor name="standard" default="true">
            <processor class="com.MyUpdateProcessor" />
            <processor class="solr.RunUpdateProcessor" />
          </updateRequestProcessor>
          
          <!-- Another one -->
          <updateRequestProcessor name="alternate">
            <processor class="org.apache.solr.ConditionalCopyProcessor" />
            <processor class="solr.RunUpdateProcessor" />
            <processor class="solr.LogUpdateProcessor" />
          </updateRequestProcessor>
          

          Why do we need factories here? It seems like there is no advantage being added by multiple factories. If the only advantage is with the factory being able to choose between instantiating on each request or using an already instantiated processor then one can argue on having factories for RequestHandlers or SearchComponents too. The Processors should be created once and re-used. Most of them are stateless and the others can use the init method and store state in instance variables. The same is done with RequestHandlers and SearchComponents at present.

          Why should we have a explicit ChainedUpdateRequestProcessorFactory? Seems from the use-cases that processors will always be chained. Let us have the implementation do the chaining instead of asking users to add a factory in the configuration.

          Not trying to be critical but seems like this is too complex for the use-cases it needs to support.

          Show
          Shalin Shekhar Mangar added a comment - I've read through most of the discussion here and the wiki page at http://wiki.apache.org/solr/UpdateRequestProcessor but I couldn't understand the reasons behind the current design. Looking at the configuration we have: <updateRequestProcessor> <factory name= "standard" class= "solr.ChainedUpdateProcessorFactory" default= "true" > <chain class= "org.apache.solr.ConditionalCopyProcessorFactory" /> <chain class= "solr.RunUpdateProcessorFactory" /> <chain class= "solr.LogUpdateProcessorFactory" /> </factory> </updateRequestProcessor> Why can't it be written as: <updateRequestProcessor name= "standard" default= "true" > <processor class= "com.MyUpdateProcessor" /> <processor class= "solr.RunUpdateProcessor" /> </updateRequestProcessor> <!-- Another one --> <updateRequestProcessor name= "alternate" > <processor class= "org.apache.solr.ConditionalCopyProcessor" /> <processor class= "solr.RunUpdateProcessor" /> <processor class= "solr.LogUpdateProcessor" /> </updateRequestProcessor> Why do we need factories here? It seems like there is no advantage being added by multiple factories. If the only advantage is with the factory being able to choose between instantiating on each request or using an already instantiated processor then one can argue on having factories for RequestHandlers or SearchComponents too. The Processors should be created once and re-used. Most of them are stateless and the others can use the init method and store state in instance variables. The same is done with RequestHandlers and SearchComponents at present. Why should we have a explicit ChainedUpdateRequestProcessorFactory? Seems from the use-cases that processors will always be chained. Let us have the implementation do the chaining instead of asking users to add a factory in the configuration. Not trying to be critical but seems like this is too complex for the use-cases it needs to support.
          Hide
          Ryan McKinley added a comment -

          Not trying to be critical but seems like this is too complex for the use-cases it needs to support.

          Nonsense – the more review / feedback / critique we get, the better – especially before a release

          "Why do we need factories here?" – the model came from how things work with Token/Filter factories. Many processors need to maintain state within a request. Check the 'log' processor. I have one that checks if the user has permission on everything in the request before executing the commands. We could have something that keeps track of what it did and backs out the changes if there is an error. If each processor were shared across all requests, any state access would need to be synchronized and have some Map<Request,State> that seems to get ugly pretty fast.

          Why ChainedUpdateRequestProcessorFactory? I see your point here. I think we can force everything to be 'chained' – The original implementation was not chained, but then the functional parts got split into their own components and chained together. Removing the parent chained factory could simplify the whole thing.

          Show
          Ryan McKinley added a comment - Not trying to be critical but seems like this is too complex for the use-cases it needs to support. Nonsense – the more review / feedback / critique we get, the better – especially before a release "Why do we need factories here?" – the model came from how things work with Token/Filter factories. Many processors need to maintain state within a request. Check the 'log' processor. I have one that checks if the user has permission on everything in the request before executing the commands. We could have something that keeps track of what it did and backs out the changes if there is an error. If each processor were shared across all requests, any state access would need to be synchronized and have some Map<Request,State> that seems to get ugly pretty fast. Why ChainedUpdateRequestProcessorFactory? I see your point here. I think we can force everything to be 'chained' – The original implementation was not chained, but then the functional parts got split into their own components and chained together. Removing the parent chained factory could simplify the whole thing.
          Hide
          Noble Paul added a comment - - edited

          Many processors need to maintain state within a request. Check the 'log' processor. I have one that checks

          if the user has permission on everything in the request before executing the commands.

          I do not think we need a factory where we need to maintain local state . Everything can be maintained in the method stack

          example

          class LocalState{
          
            private final SolrQueryRequest req;
            private final SolrQueryResponse rsp;
            private final UpdateRequestProcessor next;
            private final NamedList<Object> toLog;
          doSomething(){
          //do your thing
          }
          }
          
           public class LogUpdateProcessor extends UpdateRequestProcessor
             
          
            public void processAdd(AddUpdateCommand cmd) throws IOException {
               LocalState state = new LocalState ();//pass the params
               state.doSomeThing()
              }
              
              }
          
          

          Here the method acts as the factory and not the fra

          Show
          Noble Paul added a comment - - edited Many processors need to maintain state within a request. Check the 'log' processor. I have one that checks if the user has permission on everything in the request before executing the commands. I do not think we need a factory where we need to maintain local state . Everything can be maintained in the method stack example class LocalState{ private final SolrQueryRequest req; private final SolrQueryResponse rsp; private final UpdateRequestProcessor next; private final NamedList< Object > toLog; doSomething(){ // do your thing } } public class LogUpdateProcessor extends UpdateRequestProcessor public void processAdd(AddUpdateCommand cmd) throws IOException { LocalState state = new LocalState (); //pass the params state.doSomeThing() } } Here the method acts as the factory and not the fra
          Hide
          Shalin Shekhar Mangar added a comment -

          If each processor were shared across all requests, any state access would need to be synchronized and have some Map<Request,State> that seems to get ugly pretty fast.

          But we do have SolrQueryRequest#getContext to handle those cases, don't we? IMHO, we should not force users to write a factory class for each processor when the benefit is minimal and easy workarounds exist. Please correct me if I'm misunderstanding something.

          Nonsense - the more review / feedback / critique we get, the better - especially before a release

          Glad to hear that, though I realize that I'm a year late and that we are very close to a release

          It's just that I set out to use this API and had to jump around for quite a while to figure out how to use it and how it works. I was quite surprised to find the actual chaining happening in a class which is named NoOpUpdateProcessor – though it made sense to me later. Also, it took me a while to find the wiki page for this feature because it is not linked off the main page (or the update xml/csv pages). I could find it because I knew that a class named UpdateRequestProcessor existed. We should link it off the main page so that it can be found more easily.

          Show
          Shalin Shekhar Mangar added a comment - If each processor were shared across all requests, any state access would need to be synchronized and have some Map<Request,State> that seems to get ugly pretty fast. But we do have SolrQueryRequest#getContext to handle those cases, don't we? IMHO, we should not force users to write a factory class for each processor when the benefit is minimal and easy workarounds exist. Please correct me if I'm misunderstanding something. Nonsense - the more review / feedback / critique we get, the better - especially before a release Glad to hear that, though I realize that I'm a year late and that we are very close to a release It's just that I set out to use this API and had to jump around for quite a while to figure out how to use it and how it works. I was quite surprised to find the actual chaining happening in a class which is named NoOpUpdateProcessor – though it made sense to me later. Also, it took me a while to find the wiki page for this feature because it is not linked off the main page (or the update xml/csv pages). I could find it because I knew that a class named UpdateRequestProcessor existed. We should link it off the main page so that it can be found more easily.
          Hide
          Ryan McKinley added a comment -

          I'm all for simplifying the API. If you guys want to take a crack at it, I'll review it ASAP.

          Show
          Ryan McKinley added a comment - I'm all for simplifying the API. If you guys want to take a crack at it, I'll review it ASAP.
          Hide
          Yonik Seeley added a comment -

          But we do have SolrQueryRequest#getContext to handle those cases, don't we? IMHO, we should not force users to write a factory class for each processor when the benefit is minimal and easy workarounds exist.

          Right... the alternative to a per-request instance would be to use the request context.
          In general, I think that would be more complex for a user though (if it's something they want to do per request-batch).
          I think that can be made more efficient for bulk loading by using factories too... context lookups and decisions don't have to be made for every document.

          Show
          Yonik Seeley added a comment - But we do have SolrQueryRequest#getContext to handle those cases, don't we? IMHO, we should not force users to write a factory class for each processor when the benefit is minimal and easy workarounds exist. Right... the alternative to a per-request instance would be to use the request context. In general, I think that would be more complex for a user though (if it's something they want to do per request-batch). I think that can be made more efficient for bulk loading by using factories too... context lookups and decisions don't have to be made for every document.
          Hide
          Noble Paul added a comment -

          A very simple implementation. No factory. No state
          The attached patch has not removed the existing stuff and it is not a working model . But it demonstrates how you can put in a simpler API . The design is inspired by the ServletFilter API. but without the filterChain.doFilter() part.(it relies on the return code)

          The configuration format is

          <updateRequestProcessorChain>
            <processor class="com.MyUpdateProcessor" />
            <processor class="solr.RunUpdateProcessor" />
          </updateRequestProcessor>
          
          <!-- Another one -->
          <updateRequestProcessorChain name="alternate">
            <processor class="org.apache.solr.ConditionalCopyProcessor" />
            <processor class="solr.RunUpdateProcessor" />
            <processor class="solr.LogUpdateProcessor" />
          </updateRequestProcessorChain>
          

          The usage must be as follows

          solrCore.getUpdateProcessorChain(name).processXXX(command,  solrQueryRequest, solrQueryResponse);
          

          this patch relies on SOLR-614 for simplifying configuration

          Show
          Noble Paul added a comment - A very simple implementation. No factory. No state The attached patch has not removed the existing stuff and it is not a working model . But it demonstrates how you can put in a simpler API . The design is inspired by the ServletFilter API. but without the filterChain.doFilter() part.(it relies on the return code) The configuration format is <updateRequestProcessorChain> <processor class= "com.MyUpdateProcessor" /> <processor class= "solr.RunUpdateProcessor" /> </updateRequestProcessor> <!-- Another one --> <updateRequestProcessorChain name= "alternate" > <processor class= "org.apache.solr.ConditionalCopyProcessor" /> <processor class= "solr.RunUpdateProcessor" /> <processor class= "solr.LogUpdateProcessor" /> </updateRequestProcessorChain> The usage must be as follows solrCore.getUpdateProcessorChain(name).processXXX(command, solrQueryRequest, solrQueryResponse); this patch relies on SOLR-614 for simplifying configuration
          Hide
          Yonik Seeley added a comment -

          While I like the syntax of the config (getting rid of explicit chained update processor), I'm not sure about the internal changes:

          • I think that removing the factories does not simplify things... most processors that do interesting things will need to parse some request arguments and keep some state. So they will end up with a separate object that is looked up in the Context (and created if it's not there and stuffed into the Context). Same number of classes, but maybe even a little more complex.
          • We lose power by removing the explicit calling of "next" by components. I actually have a component that needs to buffer up some documents and pass them down the chain in batches later. I think Ryan might have something like this too.
          Show
          Yonik Seeley added a comment - While I like the syntax of the config (getting rid of explicit chained update processor), I'm not sure about the internal changes: I think that removing the factories does not simplify things... most processors that do interesting things will need to parse some request arguments and keep some state. So they will end up with a separate object that is looked up in the Context (and created if it's not there and stuffed into the Context). Same number of classes, but maybe even a little more complex. We lose power by removing the explicit calling of "next" by components. I actually have a component that needs to buffer up some documents and pass them down the chain in batches later. I think Ryan might have something like this too.
          Hide
          Ryan McKinley added a comment -

          I also like the simplified syntax, and I think the parent should always be a 'chain' – this can get rid of some of the ugliness.

          But the power of the chain model is that each link can take over control without the others needing to know. For example, I have a processor that validates everything in the request before passing it on to next processors. To do this, it reads them all in without passing them down the chain and only continues when finish() is called.

          I also don't see a problem with the factory model. creating a factory is no more/less difficult then creating a special 'state' object that gets put into the context. But the the context option, the state is always a Map call away rather them being right there. Now you have to worry about what key you used etc...

          Show
          Ryan McKinley added a comment - I also like the simplified syntax, and I think the parent should always be a 'chain' – this can get rid of some of the ugliness. But the power of the chain model is that each link can take over control without the others needing to know. For example, I have a processor that validates everything in the request before passing it on to next processors. To do this, it reads them all in without passing them down the chain and only continues when finish() is called. I also don't see a problem with the factory model. creating a factory is no more/less difficult then creating a special 'state' object that gets put into the context. But the the context option, the state is always a Map call away rather them being right there. Now you have to worry about what key you used etc...
          Hide
          Shalin Shekhar Mangar added a comment -

          I think that removing the factories does not simplify things... most processors that do interesting things will need to parse some request arguments and keep some state. So they will end up with a separate object that is looked up in the Context (and created if it's not there and stuffed into the Context). Same number of classes, but maybe even a little more complex.

          How about giving this independence through configuration?

          <updateRequestProcessorChain scope="request">
            <processor class="com.MyUpdateProcessor" />
            <processor class="solr.RunUpdateProcessor" />
          </updateRequestProcessor>
          
          <!-- Another one -->
          <updateRequestProcessorChain name="alternate">
            <processor class="org.apache.solr.ConditionalCopyProcessor" scope="request" />
            <processor class="solr.RunUpdateProcessor" />
            <processor class="solr.LogUpdateProcessor" />
          </updateRequestProcessorChain>
          

          A "request" scope will create the chain or individual processor for each request so that you may maintain state without using request's context. Otherwise, it will be created once and re-used for all requests. Will that solve this problem?

          We lose power by removing the explicit calling of "next" by components. I actually have a component that needs to buffer up some documents and pass them down the chain in batches later. I think Ryan might have something like this too.

          In Noble's patch, instead of calling super.processXXX method, you can return true/false to signal or avoid chaining.

          Show
          Shalin Shekhar Mangar added a comment - I think that removing the factories does not simplify things... most processors that do interesting things will need to parse some request arguments and keep some state. So they will end up with a separate object that is looked up in the Context (and created if it's not there and stuffed into the Context). Same number of classes, but maybe even a little more complex. How about giving this independence through configuration? <updateRequestProcessorChain scope= "request" > <processor class= "com.MyUpdateProcessor" /> <processor class= "solr.RunUpdateProcessor" /> </updateRequestProcessor> <!-- Another one --> <updateRequestProcessorChain name= "alternate" > <processor class= "org.apache.solr.ConditionalCopyProcessor" scope= "request" /> <processor class= "solr.RunUpdateProcessor" /> <processor class= "solr.LogUpdateProcessor" /> </updateRequestProcessorChain> A "request" scope will create the chain or individual processor for each request so that you may maintain state without using request's context. Otherwise, it will be created once and re-used for all requests. Will that solve this problem? We lose power by removing the explicit calling of "next" by components. I actually have a component that needs to buffer up some documents and pass them down the chain in batches later. I think Ryan might have something like this too. In Noble's patch, instead of calling super.processXXX method, you can return true/false to signal or avoid chaining.
          Hide
          Ryan McKinley added a comment - - edited

          A "request" scope will create the chain or individual processor for each request so that you may maintain state without using request's context. Otherwise, it will be created once and re-used for all requests. Will that solve this problem?

          -To me, that makes it more confusing then having each processor call next() explicitly...- (dooh - answering the wrong question) This gets overly complex too... do we add a special init() function? would everything need a factory, but it may or may not be used?

          If the motivation for making the objects shared across requests is clarity, i'm not sure it helps. Is there some other reason?

          In Noble's patch, instead of calling super.processXXX method, you can return true/false to signal or avoid chaining.

          but then how would a processor be able to continue the chain? Consider the buffering example... how would I be able to call all buffered functions on finish()? What if I want a processor to make sure only one document is sent at a time?

          Show
          Ryan McKinley added a comment - - edited A "request" scope will create the chain or individual processor for each request so that you may maintain state without using request's context. Otherwise, it will be created once and re-used for all requests. Will that solve this problem? - To me, that makes it more confusing then having each processor call next() explicitly... - (dooh - answering the wrong question) This gets overly complex too... do we add a special init() function? would everything need a factory, but it may or may not be used? If the motivation for making the objects shared across requests is clarity, i'm not sure it helps. Is there some other reason? In Noble's patch, instead of calling super.processXXX method, you can return true/false to signal or avoid chaining. but then how would a processor be able to continue the chain? Consider the buffering example... how would I be able to call all buffered functions on finish()? What if I want a processor to make sure only one document is sent at a time?
          Hide
          Shalin Shekhar Mangar added a comment -

          This gets overly complex too... do we add a special init() function? would everything need a factory, but it may or may not be used?

          No, why would we need special methods or a factory? Just the init/inform will be fine. Just that they would be called once in their scope. Am I missing something?

          I don't really care about sharing objects across requests. My motivation is only to help make the API simpler.

          Consider the buffering example... how would I be able to call all buffered functions on finish()? What if I want a processor to make sure only one document is sent at a time?

          I see your point here. The next UpdateProcessor or a Servlet FilterChain like design will be necessary in that case.

          Let me think more on this since I've obviously under-estimated the use-cases for this API. I always thought that one should do heavy-duty processing like authentication etc. on the client side before sending documents to Solr or else one should extend/write an UpdateHandler.

          Show
          Shalin Shekhar Mangar added a comment - This gets overly complex too... do we add a special init() function? would everything need a factory, but it may or may not be used? No, why would we need special methods or a factory? Just the init/inform will be fine. Just that they would be called once in their scope. Am I missing something? I don't really care about sharing objects across requests. My motivation is only to help make the API simpler. Consider the buffering example... how would I be able to call all buffered functions on finish()? What if I want a processor to make sure only one document is sent at a time? I see your point here. The next UpdateProcessor or a Servlet FilterChain like design will be necessary in that case. Let me think more on this since I've obviously under-estimated the use-cases for this API. I always thought that one should do heavy-duty processing like authentication etc. on the client side before sending documents to Solr or else one should extend/write an UpdateHandler.
          Hide
          Noble Paul added a comment -

          The idea is to make the API simple. If a Processor wishes to create a
          state object , it is easier to do it without a factory than with a
          factory. The user has to care about very few interfaces. I can draw
          parallels with Servlet Filter. Users write very complex filters and I
          have never seen people complaining about it not having a factory .
          SolrDispatchFilter is a very good example. If it is simple enough
          people will use it. If it is complex only the 'very smart people' use
          it. Most of the users are not power users and they just want to get
          things done.

          On Fri, Jul 25, 2008 at 10:27 PM, Shalin Shekhar Mangar (JIRA)


          --Noble Paul

          Show
          Noble Paul added a comment - The idea is to make the API simple. If a Processor wishes to create a state object , it is easier to do it without a factory than with a factory. The user has to care about very few interfaces. I can draw parallels with Servlet Filter. Users write very complex filters and I have never seen people complaining about it not having a factory . SolrDispatchFilter is a very good example. If it is simple enough people will use it. If it is complex only the 'very smart people' use it. Most of the users are not power users and they just want to get things done. On Fri, Jul 25, 2008 at 10:27 PM, Shalin Shekhar Mangar (JIRA) – --Noble Paul

            People

            • Assignee:
              Ryan McKinley
              Reporter:
              Ryan McKinley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development