Solr
  1. Solr
  2. SOLR-2842

Re-factor UpdateChain and UpdateProcessor interfaces

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: update
    • Labels:
      None

      Description

      The UpdateChain's main task is to send SolrInputDocuments through a chain of UpdateRequestProcessors in order to transform them in some way and then (typically) indexing them.

      This generic "pipeline" concept would also be useful on the client side (SolrJ), so that we could choose to do parts or all of the processing on the client. The most prominent use case is extracting text (Tika) from large binary documents, residing on local storage on the client(s). Streaming hundreds of Mb over to Solr for processing is not efficcient. See SOLR-1526.

      We're already implementing Tika as an UpdateProcessor in SOLR-1763, and what would be more natural than reusing this - and any other processor - on the client side?

      However, for this to be possible, some interfaces need to change slightly..

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          With all the libraries, configuration, and everything else one would need in this client, it starts looking very much like a Solr server again! I can even imagine once one has this fat client, that one would want to be able to accept requests from others to get the same processing. It almost seems preferable to "just use solr" instances as these special tika processors.

          It might make sense when setting up a cluster to have a bank of solr servers dedicated just to rich document processing, and then they will forward the processed document to the correct shard (assuming new solr cloud stuff).

          Custom code could somehow live in that bank of indexers to avoid an extra copy of large binary documents, or outside clients could use stream.url to make solr directly stream the large file from the source.

          Show
          Yonik Seeley added a comment - With all the libraries, configuration, and everything else one would need in this client, it starts looking very much like a Solr server again! I can even imagine once one has this fat client, that one would want to be able to accept requests from others to get the same processing. It almost seems preferable to "just use solr" instances as these special tika processors. It might make sense when setting up a cluster to have a bank of solr servers dedicated just to rich document processing, and then they will forward the processed document to the correct shard (assuming new solr cloud stuff). Custom code could somehow live in that bank of indexers to avoid an extra copy of large binary documents, or outside clients could use stream.url to make solr directly stream the large file from the source.
          Hide
          Jan Høydahl added a comment -

          Yep, for distrib cloud stuff it would be cool to be able to have dedicated doc processor nodes.

          I don't think the client necessarily needs to be THAT fat or complex if this is done right. If we make the UpdateChain and the Processor itself more stand-alone, not depending on SolrCore, and make updateChains easily configurable outside of solrconfig.xml (see SOLR-2841), then it would be straight-forward to instansiate a chain on the client side, without the RunUpdateProcessor of course. Some processors use Schema, so we'd perhaps need a way to fetch the correct schema from the server, using admin/file or even better, ZK.

          Show
          Jan Høydahl added a comment - Yep, for distrib cloud stuff it would be cool to be able to have dedicated doc processor nodes. I don't think the client necessarily needs to be THAT fat or complex if this is done right. If we make the UpdateChain and the Processor itself more stand-alone, not depending on SolrCore, and make updateChains easily configurable outside of solrconfig.xml (see SOLR-2841 ), then it would be straight-forward to instansiate a chain on the client side, without the RunUpdateProcessor of course. Some processors use Schema, so we'd perhaps need a way to fetch the correct schema from the server, using admin/file or even better, ZK.
          Hide
          Mark Miller added a comment -

          If we make the UpdateChain and the Processor itself more stand-alone, not depending on SolrCore,

          But the update processor should have access to SolrCore. I don't think this is something we want to drop. You do want access to the Request object and the SolrCore, as you have now.

          Show
          Mark Miller added a comment - If we make the UpdateChain and the Processor itself more stand-alone, not depending on SolrCore, But the update processor should have access to SolrCore. I don't think this is something we want to drop. You do want access to the Request object and the SolrCore, as you have now.
          Hide
          Jan Høydahl added a comment -

          But the update processor should have access to SolrCore. I don't think this is something we want to drop. You do want access to the Request object and the SolrCore, as you have now.

          The UpdateRequestProcessorChain now depends on SolrCore for getting config from solrconfig.xml, so we'd first need to separate updateChain config from solrconfig, e.g. through SOLR-2841 or similar. Although flexible to have "full access" for the Processors, it doesn't necessarily give the best APIs. Most processors will only need access to the input document and request params. In addition I think schema access for validating input and a resource loader to load own config from file are good candidates for what to provide to Processors. The resource loader on the client side could resolve resources locally, or even through the ZK loader.

          The remaining 5% of processors which really need SolrCore (such as RunUpdateProcessor) should implement SolrCoreAware, and UpdateChain should statically check and throw an exception if any of these are attempted loaded in a context where SolrCore is null.

          Show
          Jan Høydahl added a comment - But the update processor should have access to SolrCore. I don't think this is something we want to drop. You do want access to the Request object and the SolrCore, as you have now. The UpdateRequestProcessorChain now depends on SolrCore for getting config from solrconfig.xml, so we'd first need to separate updateChain config from solrconfig, e.g. through SOLR-2841 or similar. Although flexible to have "full access" for the Processors, it doesn't necessarily give the best APIs. Most processors will only need access to the input document and request params. In addition I think schema access for validating input and a resource loader to load own config from file are good candidates for what to provide to Processors. The resource loader on the client side could resolve resources locally, or even through the ZK loader. The remaining 5% of processors which really need SolrCore (such as RunUpdateProcessor) should implement SolrCoreAware, and UpdateChain should statically check and throw an exception if any of these are attempted loaded in a context where SolrCore is null.
          Hide
          Chris Male added a comment - - edited

          I really agree with Yonik here. I'm very wary of making the client API any more complex. There are many processing pipeline implementations out there, why should ours go client-side? It'd only benefit those using SolrJ and come at the cost of increased complexity. Having to check that the UpdateProcessor is running on on a client or server and then throwing Exceptions in certain circumstances... it all just feels a little messy!

          The Tika processing situation seems a different problem which Yonik's suggestion seems very reasonable - have a local Solr instance that replicates.

          I also agree with Mark. We shouldn't strip access to SolrCore, but I think we can reach a middle ground where the UpdateProcessor can define whether it wants a SolrCore reference? Bean setters anybody? The same goes for any Schemas / ResourceLoaders. We should make them all optional but definitely accessible.

          I don't want to seem like a downer, because I am fully for any refactorings and cleanups of these interfaces where possible.

          Show
          Chris Male added a comment - - edited I really agree with Yonik here. I'm very wary of making the client API any more complex. There are many processing pipeline implementations out there, why should ours go client-side? It'd only benefit those using SolrJ and come at the cost of increased complexity. Having to check that the UpdateProcessor is running on on a client or server and then throwing Exceptions in certain circumstances... it all just feels a little messy! The Tika processing situation seems a different problem which Yonik's suggestion seems very reasonable - have a local Solr instance that replicates. I also agree with Mark. We shouldn't strip access to SolrCore, but I think we can reach a middle ground where the UpdateProcessor can define whether it wants a SolrCore reference? Bean setters anybody? The same goes for any Schemas / ResourceLoaders. We should make them all optional but definitely accessible. I don't want to seem like a downer, because I am fully for any refactorings and cleanups of these interfaces where possible.
          Hide
          Jan Høydahl added a comment -

          Some valid points there. I thought I saw a possibility for generalization that would help solve SOLR-1526, but wanted to flesh out the feasibility here.

          So far I do not see any other example than Tika extraction which could really benefit from being done client-side. There may be others, but perhaps not justifying this change.

          Another option for SOLR-1526 could be to provide a ClientExtractingUpdateProcessorFactory class which instantiates the ExtractingUpdateProcessor for client side use. Then if other processors are useful on the client side as well, people simply write a Client factory for them?

          Show
          Jan Høydahl added a comment - Some valid points there. I thought I saw a possibility for generalization that would help solve SOLR-1526 , but wanted to flesh out the feasibility here. So far I do not see any other example than Tika extraction which could really benefit from being done client-side. There may be others, but perhaps not justifying this change. Another option for SOLR-1526 could be to provide a ClientExtractingUpdateProcessorFactory class which instantiates the ExtractingUpdateProcessor for client side use. Then if other processors are useful on the client side as well, people simply write a Client factory for them?
          Hide
          Ryan McKinley added a comment -

          rather then use UpdateProcessor directly, what about adding a simple interface like:

           SolrInputDocument transform(SolrInputDocument)
          

          and using simple bean getter/setters – perhaps also respecting the 'aware' interfaces (SolrCoreAware, SchemaAware, ResourceLoaderAware)

          It seems like most of the custom things we would want to do only care about 'add' and don't care about commit,delete,merge,rollback. Starting with a simple interface like this would give us lots of flexibility to integrate wherever it feels most appropriate – client/server or any other pipeline framework (I've been using commons pipeline with pretty reasonable success)

          Show
          Ryan McKinley added a comment - rather then use UpdateProcessor directly, what about adding a simple interface like: SolrInputDocument transform(SolrInputDocument) and using simple bean getter/setters – perhaps also respecting the 'aware' interfaces (SolrCoreAware, SchemaAware, ResourceLoaderAware) It seems like most of the custom things we would want to do only care about 'add' and don't care about commit,delete,merge,rollback. Starting with a simple interface like this would give us lots of flexibility to integrate wherever it feels most appropriate – client/server or any other pipeline framework (I've been using commons pipeline with pretty reasonable success)
          Hide
          Jan Høydahl added a comment -

          Interesting thought Ryan. I already commonly isolate processing in a similar method to simplify unit testing so this is useful in several ways. Do you suggest to let UpdateProcessor base class implement this interface? But you still need to construct and initialize the processors even if they are wrapped in the interface, thus my suggestion for a client side version of the factory.

          Show
          Jan Høydahl added a comment - Interesting thought Ryan. I already commonly isolate processing in a similar method to simplify unit testing so this is useful in several ways. Do you suggest to let UpdateProcessor base class implement this interface? But you still need to construct and initialize the processors even if they are wrapped in the interface, thus my suggestion for a client side version of the factory.
          Hide
          Ryan McKinley added a comment -

          I don't have a real proposal... just thinking about generally reusable pipeline code.

          Do you suggest to let UpdateProcessor base class implement this interface?

          No, since most domain specific UpdateProcessors can be boiled down to this (tika, langid, geonames, etc) i don't think they need to have access to the whole UpdateProcessor – only sometimes do they need access to SolrCore/Schema/ResourceLoader etc. With minimal dependencies, moving them around would be easy.

          I was thinking we could have a general TransformingUpdateProcesor that could take a list of transformers (or something), rather then having all the dependencies

          But you still need to construct and initialize the processors even if they are wrapped in the interface, thus my suggestion for a client side version of the factory.

          I'm not convinced that a client side framework is necessary if the interfaces were easy enough to deal with directly. I can see where a DSL would be cool, but having a client side NamedListInitalizedPlugin seems like a can of worms

          .

          Show
          Ryan McKinley added a comment - I don't have a real proposal... just thinking about generally reusable pipeline code. Do you suggest to let UpdateProcessor base class implement this interface? No, since most domain specific UpdateProcessors can be boiled down to this (tika, langid, geonames, etc) i don't think they need to have access to the whole UpdateProcessor – only sometimes do they need access to SolrCore/Schema/ResourceLoader etc. With minimal dependencies, moving them around would be easy. I was thinking we could have a general TransformingUpdateProcesor that could take a list of transformers (or something), rather then having all the dependencies But you still need to construct and initialize the processors even if they are wrapped in the interface, thus my suggestion for a client side version of the factory. I'm not convinced that a client side framework is necessary if the interfaces were easy enough to deal with directly. I can see where a DSL would be cool, but having a client side NamedListInitalizedPlugin seems like a can of worms .
          Hide
          Chris Male added a comment -

          Ryan is on the right track here. Really many of these changes are just related to the 'Add' update logic. UpdateProcessor provides a lot of integration with Solr's CRUD features and I think we should leave it that way. If we want to provide a cleaner system for doing Document manipulations that would be part of adding a Document, then lets keep the focus on that.

          Having a simple interface like Ryan suggests seems the best way to go forward. Then all the actual meat of the Document manipulation logic can go there. Integration with UpdateProcessor seems pretty straightforward. It then frees us up to tackle configuration / DSLs / reuse / whatever else buzzword, without changing the already powerful and functional UpdateProcessor framework.

          Show
          Chris Male added a comment - Ryan is on the right track here. Really many of these changes are just related to the 'Add' update logic. UpdateProcessor provides a lot of integration with Solr's CRUD features and I think we should leave it that way. If we want to provide a cleaner system for doing Document manipulations that would be part of adding a Document, then lets keep the focus on that. Having a simple interface like Ryan suggests seems the best way to go forward. Then all the actual meat of the Document manipulation logic can go there. Integration with UpdateProcessor seems pretty straightforward. It then frees us up to tackle configuration / DSLs / reuse / whatever else buzzword, without changing the already powerful and functional UpdateProcessor framework.
          Hide
          Mark Miller added a comment - - edited

          cool - I like where this is going.

          Show
          Mark Miller added a comment - - edited cool - I like where this is going.
          Hide
          Hoss Man added a comment -

          In general i'm leery of trying to completely refactor the chain/processor APIs to work on clients as well as the server, because right now they deal with "commands" (id: add command, delete command) and i'm not sure how well that will map on the client side.

          the right layer to probably approach this from (for processors where it makes sense) is to encapsulate the core functionality of each processor in a class that can be reused on either client of server side, and then make the processor just responsible for dealing with the configuration and request params (things that on the client side would just be part of the java api)

          Ryan's suggestion about a simple transformer interface is pretty much dead in line with the goal i was aiming for in SOLR-2802, but i certainly hadn't considered the idea of reusing these on the client – just making it easier to write them in a few lines of code, so the inheritance relationship used in the SOLR-2802 patches would probably need to be refactored into a delegation relationship, and then maybe it could serve both purposes?

          Show
          Hoss Man added a comment - In general i'm leery of trying to completely refactor the chain/processor APIs to work on clients as well as the server, because right now they deal with "commands" (id: add command, delete command) and i'm not sure how well that will map on the client side. the right layer to probably approach this from (for processors where it makes sense) is to encapsulate the core functionality of each processor in a class that can be reused on either client of server side, and then make the processor just responsible for dealing with the configuration and request params (things that on the client side would just be part of the java api) Ryan's suggestion about a simple transformer interface is pretty much dead in line with the goal i was aiming for in SOLR-2802 , but i certainly hadn't considered the idea of reusing these on the client – just making it easier to write them in a few lines of code, so the inheritance relationship used in the SOLR-2802 patches would probably need to be refactored into a delegation relationship, and then maybe it could serve both purposes?

            People

            • Assignee:
              Unassigned
              Reporter:
              Jan Høydahl
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development