Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7188

Run Data Import Handler processes in a SolrJ client

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Adds a DataImportHandlerClient class that wraps an EmbeddedSolrServer and adds a DIHCloudWriter implementation of DIHWriter that sends documents to a remote SolrCloud cluster. This enables existing DIH processes to run outside of the Solr JVM which should enable better scalability.

      The current architecture of DIH imposes several restrictions on scalability. First, the DIH runs in the same process space as Solr itself and competes for resources (CPU and memory) with normal Solr processes devoted to indexing and querying. Second, the DIH cannot be multi-threaded which means that parallelizing it requires splitting the processing amongst nodes in a SolrCloud cluster. Since the incoming data is sent through an UpdateRequestProcessor chain (via the SolrWriter implementation of DIHWriter), additional routing is done internally as the documents are forwarded to the current shard leader nodes once the ID hash is computed. This causes additional network traffic within the SolrCloud cluster. Scaling the DIH is limited by the number of nodes in the cluster and any heavy-duty processing due to entity processors or transformation elements shares the processing resources of Solr itself. This is known to be a source of bottlenecks in Solr installations (SolrCloud or Master-Slave) that use DIH.

      The DataImportHandlerClient uses native DIH functionality - DataImporter, etc. but can be run externally to Solr. This means that as many processes as are needed to achieve necessary performance at scale can be added and the processing that occurs within the DataImportHandler is done outside of the Solr JVM. The same benefits that accrue with multiple SolrJ clients can now be realized with DIH without the necessity of porting code from DIH to a SolrJ client.

      1. ESS_as_a_copy.patch
        10 kB
        Noble Paul
      2. IDEA-AS-CODE.patch
        56 kB
        Noble Paul
      3. SOLR-7188.patch
        144 kB
        Noble Paul
      4. SOLR-7188.patch
        144 kB
        Noble Paul
      5. SOLR-7188.patch
        142 kB
        Ted Sullivan
      6. SOLR-7188.patch
        62 kB
        Ted Sullivan
      7. SOLR-7188.patch
        63 kB
        Ted Sullivan

        Issue Links

          Activity

          Hide
          elyograg Shawn Heisey added a comment -

          Wouldn't it be a better option to refactor DIH as a component that uses only SolrJ client APIs instead of internal Solr APIs? Making a cloud-aware version would be probably be really easy if you write the code to use SolrClient, because you can simply swap HttpSolrClient for CloudSolrClient in the DIH code. For standalone user applications, the required resources would be lower if you don't use a full EmbeddedSolrServer.

          Show
          elyograg Shawn Heisey added a comment - Wouldn't it be a better option to refactor DIH as a component that uses only SolrJ client APIs instead of internal Solr APIs? Making a cloud-aware version would be probably be really easy if you write the code to use SolrClient, because you can simply swap HttpSolrClient for CloudSolrClient in the DIH code. For standalone user applications, the required resources would be lower if you don't use a full EmbeddedSolrServer.
          Hide
          tedsullivan Ted Sullivan added a comment -

          Shawn - yes I agree. The component should support both HttpSolrClient and CloudSolrClient so that is a good suggestion. As to standalone - I wanted to ensure that existing DIH code would "just work" in a client - EmbeddedSolrServer provides the necessary "substrate" to plug in the DataImporter which requires a SolrCore to function Any ideas on how to implement that would be much appreciated.

          Show
          tedsullivan Ted Sullivan added a comment - Shawn - yes I agree. The component should support both HttpSolrClient and CloudSolrClient so that is a good suggestion. As to standalone - I wanted to ensure that existing DIH code would "just work" in a client - EmbeddedSolrServer provides the necessary "substrate" to plug in the DataImporter which requires a SolrCore to function Any ideas on how to implement that would be much appreciated.
          Hide
          elyograg Shawn Heisey added a comment -

          I have tried to understand DIH code in the past. It has proven to be very complex, and I get lost easily. I probably need a knowledgeable guide to have any hope.

          Show
          elyograg Shawn Heisey added a comment - I have tried to understand DIH code in the past. It has proven to be very complex, and I get lost easily. I probably need a knowledgeable guide to have any hope.
          Hide
          tedsullivan Ted Sullivan added a comment - - edited

          Shawn Heisey - thanks - I have refactored the code to take any SolrClient as a proxy so that it does not create its own client internally.

          I'm not sure what you mean by "Wouldn't it be a better option to refactor DIH as a component that uses only SolrJ client APIs instead of internal Solr APIs?"

          I have not refactored DIH - that seems to be more involved than the wrapper that I wrote. Not exactly sure what you are proposing here. What are your concerns about using EmbeddedSolrServer? It really doesn't do any processing except to provide the machinery to load the data config and so forth, so I am not sure what your concerns are.

          Show
          tedsullivan Ted Sullivan added a comment - - edited Shawn Heisey - thanks - I have refactored the code to take any SolrClient as a proxy so that it does not create its own client internally. I'm not sure what you mean by "Wouldn't it be a better option to refactor DIH as a component that uses only SolrJ client APIs instead of internal Solr APIs?" I have not refactored DIH - that seems to be more involved than the wrapper that I wrote. Not exactly sure what you are proposing here. What are your concerns about using EmbeddedSolrServer? It really doesn't do any processing except to provide the machinery to load the data config and so forth, so I am not sure what your concerns are.
          Hide
          elyograg Shawn Heisey added a comment -

          Wouldn't it be a better option to refactor DIH as a component that uses only SolrJ client APIs instead of internal Solr APIs?

          I mean that DIH should be changed so that its only Solr/Lucene dependency is solr-solrj, specificially the SolrClient class (and any transitive dependencies). The dependency on solr-core should disappear.

          What are your concerns about using EmbeddedSolrServer?

          The embedded object has dependencies outside of the solrj directory that are not obvious and as far as I can tell are not well documented. I don't know if any of the Lucene jars are required for the usage you have implemented, but I suspect they probably are. If so, none of those jars are included as distinct files in the Solr download.

          Rebuilding DIH so compiling only requires SolrClient would allow any implementation. With some care (and additional jars added by the user), that probably would include EmbeddedSolrServer.

          Show
          elyograg Shawn Heisey added a comment - Wouldn't it be a better option to refactor DIH as a component that uses only SolrJ client APIs instead of internal Solr APIs? I mean that DIH should be changed so that its only Solr/Lucene dependency is solr-solrj, specificially the SolrClient class (and any transitive dependencies). The dependency on solr-core should disappear. What are your concerns about using EmbeddedSolrServer? The embedded object has dependencies outside of the solrj directory that are not obvious and as far as I can tell are not well documented. I don't know if any of the Lucene jars are required for the usage you have implemented, but I suspect they probably are. If so, none of those jars are included as distinct files in the Solr download. Rebuilding DIH so compiling only requires SolrClient would allow any implementation. With some care (and additional jars added by the user), that probably would include EmbeddedSolrServer.
          Hide
          tedsullivan Ted Sullivan added a comment -

          thanks Shawn Heisey - I'll know more about the dependencies when I do some system testing and will document these. As to rebuilding DIH - hopefully someone with more experience in this such as Noble Paul can comment on the relative merits of that approach.

          Show
          tedsullivan Ted Sullivan added a comment - thanks Shawn Heisey - I'll know more about the dependencies when I do some system testing and will document these. As to rebuilding DIH - hopefully someone with more experience in this such as Noble Paul can comment on the relative merits of that approach.
          Hide
          noble.paul Noble Paul added a comment -

          I mean that DIH should be changed so that its only Solr/Lucene dependency is solr-solrj, specificially the SolrClient class (and any transitive dependencies). The dependency on solr-core should disappear.

          It would be nice if we could do that . But DIH itself is an instance of a SolrRequestHandler so a compile-time dependency is inevitable. However we can avoid runtime dependencies. We can abstract out certain parts and make this possible . But, I guess that should be a separate ticket . It is better to keep this ticket focussed on just one thing

          Show
          noble.paul Noble Paul added a comment - I mean that DIH should be changed so that its only Solr/Lucene dependency is solr-solrj, specificially the SolrClient class (and any transitive dependencies). The dependency on solr-core should disappear. It would be nice if we could do that . But DIH itself is an instance of a SolrRequestHandler so a compile-time dependency is inevitable. However we can avoid runtime dependencies. We can abstract out certain parts and make this possible . But, I guess that should be a separate ticket . It is better to keep this ticket focussed on just one thing
          Hide
          elyograg Shawn Heisey added a comment -

          But DIH itself is an instance of a SolrRequestHandler so a compile-time dependency is inevitable.

          Ah, that makes sense. So to achieve what I'm imagining, it would need to be split into two parts - the handler and the actual DIH component used by the handler.

          Everything about my idea is theoretical at this point, this is just what I would like to see. I did not try to evaluate Ted's patch, and I do not want to discourage anyone. Almost any additional capability is worth pursuing, unless we know we can avoid a complete rewrite for future plans by pursuing a different idea.

          Show
          elyograg Shawn Heisey added a comment - But DIH itself is an instance of a SolrRequestHandler so a compile-time dependency is inevitable. Ah, that makes sense. So to achieve what I'm imagining, it would need to be split into two parts - the handler and the actual DIH component used by the handler. Everything about my idea is theoretical at this point, this is just what I would like to see. I did not try to evaluate Ted's patch, and I do not want to discourage anyone. Almost any additional capability is worth pursuing, unless we know we can avoid a complete rewrite for future plans by pursuing a different idea.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          I'm fairly new to Solr, but from experience on TIKA-1302, anything you can do to encapsulate Tika's jvm from everything else is a great idea. No matter how hard we try over on Tika, parsers will misbehave, go OOM and/or have permanent hangs. These happen fairly rarely, but when they do, they're a showstopper for anything in Tika's jvm.

          We recently added an EvilParser to our tika-parser test suite, and if you are interested in hardening DIH against these issues, that parser might be useful for testing.

          Another thought would be to use tika-server (JAX-RS) and call out to that from DIH. In the next few months, we plan to harden that so that it will shutdown on oom/permahang and a parent process will restart it.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited I'm fairly new to Solr, but from experience on TIKA-1302 , anything you can do to encapsulate Tika's jvm from everything else is a great idea. No matter how hard we try over on Tika, parsers will misbehave, go OOM and/or have permanent hangs. These happen fairly rarely, but when they do, they're a showstopper for anything in Tika's jvm. We recently added an EvilParser to our tika-parser test suite, and if you are interested in hardening DIH against these issues, that parser might be useful for testing. Another thought would be to use tika-server (JAX-RS) and call out to that from DIH. In the next few months, we plan to harden that so that it will shutdown on oom/permahang and a parent process will restart it.
          Hide
          tedsullivan Ted Sullivan added a comment -

          Tim Allison thanks - there are similar issues in Solr - OOM, Stop-the-world GC events, etc. if the Solr jvm is encumbered by custom processing.

          Shawn Heisey - separating out a DIH component that could be used with both Solr and SolrJ makes the most sense to me. A cursory examination of the code suggests that a lot of what SolrCore provides is support for things like resource loading, schema, Zookeeper awareness and such. I haven't done a full analysis yet but the bulk of the dependencies appear to be in DataImporter and DocBuilder. That solution would need to be backward compatible with existing Solr DIH architectures. That is a larger problem than the one that I set out to solve (when has this happened before, right?)

          There is a trivial reason that my DataImportHandlerClient class is in the dataimport package - DataImporter includes some package-private methods. I did not want to make changes to existing code for this solution.

          Show
          tedsullivan Ted Sullivan added a comment - Tim Allison thanks - there are similar issues in Solr - OOM, Stop-the-world GC events, etc. if the Solr jvm is encumbered by custom processing. Shawn Heisey - separating out a DIH component that could be used with both Solr and SolrJ makes the most sense to me. A cursory examination of the code suggests that a lot of what SolrCore provides is support for things like resource loading, schema, Zookeeper awareness and such. I haven't done a full analysis yet but the bulk of the dependencies appear to be in DataImporter and DocBuilder. That solution would need to be backward compatible with existing Solr DIH architectures. That is a larger problem than the one that I set out to solve (when has this happened before, right?) There is a trivial reason that my DataImportHandlerClient class is in the dataimport package - DataImporter includes some package-private methods. I did not want to make changes to existing code for this solution.
          Hide
          noble.paul Noble Paul added a comment -

          Ted Sullivan I don't think backcompat will be an issue.

          Basically, what we need is a class called DIHSolrAbstractionLayer and all classes in DIH should switch to use that class for all Solrrelated services , which would have multiple implementations depending on where it is being run

          Show
          noble.paul Noble Paul added a comment - Ted Sullivan I don't think backcompat will be an issue. Basically, what we need is a class called DIHSolrAbstractionLayer and all classes in DIH should switch to use that class for all Solrrelated services , which would have multiple implementations depending on where it is being run
          Hide
          noble.paul Noble Paul added a comment - - edited

          An idea of isolating the Solr core dependencies out into a separate class. untested

          There is an interface called AbstarctionLayer and currently I just moved all code to SolrAbstractionLayer. If we can provide another implementation called SolrjAbstractionLayer we will be fine

          Show
          noble.paul Noble Paul added a comment - - edited An idea of isolating the Solr core dependencies out into a separate class. untested There is an interface called AbstarctionLayer and currently I just moved all code to SolrAbstractionLayer. If we can provide another implementation called SolrjAbstractionLayer we will be fine
          Hide
          tedsullivan Ted Sullivan added a comment -

          I have refactored the DIH code using Noble Paul AbstractionLayer interface idea. All tests pass now. Currently, the DIH client code is in a sub directory in org/apache/solr/handler/dataimport/client - It may make sense to move this code to SolrJ for jar packaging etc. That would probably require a new JIRA ticket - for the refactoring / abstraction layer piece. Not sure what the best strategy is at this point.

          TO DOs: There is one bit of code needed in the ClientAbstractionLayer to reproduce the IndexSchema without requiring SolrConfig. Also need a test case for this since the original test case does not need to use this code.

          Show
          tedsullivan Ted Sullivan added a comment - I have refactored the DIH code using Noble Paul AbstractionLayer interface idea. All tests pass now. Currently, the DIH client code is in a sub directory in org/apache/solr/handler/dataimport/client - It may make sense to move this code to SolrJ for jar packaging etc. That would probably require a new JIRA ticket - for the refactoring / abstraction layer piece. Not sure what the best strategy is at this point. TO DOs: There is one bit of code needed in the ClientAbstractionLayer to reproduce the IndexSchema without requiring SolrConfig. Also need a test case for this since the original test case does not need to use this code.
          Hide
          noble.paul Noble Paul added a comment -

          With the schema API in place we can really get the schema information through a REST API.

          Show
          noble.paul Noble Paul added a comment - With the schema API in place we can really get the schema information through a REST API.
          Hide
          noble.paul Noble Paul added a comment -

          A lot of tests are failing with the newest patch

          Show
          noble.paul Noble Paul added a comment - A lot of tests are failing with the newest patch
          Hide
          noble.paul Noble Paul added a comment -

          fixing compile issues

          Show
          noble.paul Noble Paul added a comment - fixing compile issues
          Hide
          noble.paul Noble Paul added a comment -

          We should change the client AbstractionLayer to fetch schema from the schema API http://host:8983/solr/coll/schema will give the whole schema out and we can parse it out to know the fields

          I would recommend to split this into two. Let is check in the AbstractionLayer and the SolrAbstractionLayer in this ticket and let's open a sub issue for the ClientAbstractionLayer . We need to document on how we plan to run this from command-line

          Show
          noble.paul Noble Paul added a comment - We should change the client AbstractionLayer to fetch schema from the schema API http://host:8983/solr/coll/schema will give the whole schema out and we can parse it out to know the fields I would recommend to split this into two. Let is check in the AbstractionLayer and the SolrAbstractionLayer in this ticket and let's open a sub issue for the ClientAbstractionLayer . We need to document on how we plan to run this from command-line
          Hide
          noble.paul Noble Paul added a comment -

          I was thinking of another solution where we can point the EmbeddedSolrServer to ZK and it can start up without being a member of the collection. In that case we will have the whole DIH running in the EmbeddedSolrServer and just replace the SolrWriter with a SolrJWriter (which writes to Solr using SolrJ) and we are good to go

          Show
          noble.paul Noble Paul added a comment - I was thinking of another solution where we can point the EmbeddedSolrServer to ZK and it can start up without being a member of the collection. In that case we will have the whole DIH running in the EmbeddedSolrServer and just replace the SolrWriter with a SolrJWriter (which writes to Solr using SolrJ) and we are good to go
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          though, some people don't run ZK at all, it should fallback to REST schema.

          Show
          mkhludnev Mikhail Khludnev added a comment - though, some people don't run ZK at all, it should fallback to REST schema.
          Show
          mkhludnev Mikhail Khludnev added a comment - sadly, there is no such implementation so far https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/morphlines-core/src/java/org/apache/solr/morphlines/solr/SolrLocator.java#L127
          Hide
          noble.paul Noble Paul added a comment - - edited

          We can use the ZookeeperInfoServlet to retrieve everything from the REST interface . We just need to provide a ResourceLoader which can load fies from the HTTP end point

          Show
          noble.paul Noble Paul added a comment - - edited We can use the ZookeeperInfoServlet to retrieve everything from the REST interface . We just need to provide a ResourceLoader which can load fies from the HTTP end point
          Hide
          tedsullivan Ted Sullivan added a comment -

          Noble Paul That is similar to my original submission idea except that it was a DIHClientBase subclass that I called DIHClientWriter. It used a local configuration directory, not ZK. So I would think that we could do both - use ZK or local configuration. DIHClientWriter calls CloudSolrServer so that also should be configurable I would think.

          Show
          tedsullivan Ted Sullivan added a comment - Noble Paul That is similar to my original submission idea except that it was a DIHClientBase subclass that I called DIHClientWriter. It used a local configuration directory, not ZK. So I would think that we could do both - use ZK or local configuration. DIHClientWriter calls CloudSolrServer so that also should be configurable I would think.
          Hide
          noble.paul Noble Paul added a comment -

          We need to load the config from ZK or the rest interface

          Show
          noble.paul Noble Paul added a comment - We need to load the config from ZK or the rest interface
          Hide
          noble.paul Noble Paul added a comment - - edited

          running EmbeddedSolrServer as a copy of a live server (Untested) Varun Thacker This could be a totally new ticket

          Show
          noble.paul Noble Paul added a comment - - edited running EmbeddedSolrServer as a copy of a live server (Untested) Varun Thacker This could be a totally new ticket

            People

            • Assignee:
              Unassigned
              Reporter:
              tedsullivan Ted Sullivan
            • Votes:
              11 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development