ManifoldCF
  1. ManifoldCF
  2. CONNECTORS-19

Look into converting SOLR connector to use SolrJ java library

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: ManifoldCF 0.1, ManifoldCF 0.2
    • Fix Version/s: ManifoldCF 1.1
    • Component/s: Lucene/SOLR connector
    • Labels:
      None

      Description

      The SOLR connector currently uses its own multipart post code. It might be a good idea to convert it to use the SolrJ client api jar instead. This would require license confirmation, plus research to make sure there are no jar conflicts as a result, with any other connector.

        Activity

        Hide
        Jettro Coenradie added a comment - - edited

        We have a working solr connector that makes use of solrj. This might be a good start. I might need to spend some time to make it run in the lcf build. We have a maven build to package it at the moment. If you are interested, let me know. Than I will spend the time on a patch.

        Show
        Jettro Coenradie added a comment - - edited We have a working solr connector that makes use of solrj. This might be a good start. I might need to spend some time to make it run in the lcf build. We have a maven build to package it at the moment. If you are interested, let me know. Than I will spend the time on a patch.
        Hide
        Karl Wright added a comment -

        I am certainly interested. One thing I want to be certain of though is what jar dependencies would be necessary for your implementation, and whether the connector you have built is indeed as full-featured as the one it would be replacing?

        Show
        Karl Wright added a comment - I am certainly interested. One thing I want to be certain of though is what jar dependencies would be necessary for your implementation, and whether the connector you have built is indeed as full-featured as the one it would be replacing?
        Hide
        Jettro Coenradie added a comment -

        I will have a good look at the dependencies and the functionality. If satisfied, I will supply a patch that other can check as well.

        Show
        Jettro Coenradie added a comment - I will have a good look at the dependencies and the functionality. If satisfied, I will supply a patch that other can check as well.
        Hide
        Jan Høydahl added a comment -

        Any progress on this? I'd like to see a Solr outputConnector with MultiThread support (StreamingUpdateSolrServer)

        Show
        Jan Høydahl added a comment - Any progress on this? I'd like to see a Solr outputConnector with MultiThread support (StreamingUpdateSolrServer)
        Hide
        Karl Wright added a comment -

        The promised patch never materialized.

        One point, though, is that ManifoldCF is not single-threaded in any case, so you'd be unlikely to gain much in performance by going "multithread" on an already multi-threaded connector implementation. The current connector can maintain and use as many connections to Solr as you tell it. Memory buffering on the client side also is not a good idea because it violates the basic ManifoldCF principle that you can safely shut down and restart ManifoldCF at any time without loss.

        Solr also suffers from lack of a "guaranteed delivery" metaphor, which I've talked to the Solr team about in the past. The Solr commit model currently does not work this way but ManifoldCF really requires it, because without it there is no way to properly implement an incremental crawler. This would mean a significant new Solr feature.

        Show
        Karl Wright added a comment - The promised patch never materialized. One point, though, is that ManifoldCF is not single-threaded in any case, so you'd be unlikely to gain much in performance by going "multithread" on an already multi-threaded connector implementation. The current connector can maintain and use as many connections to Solr as you tell it. Memory buffering on the client side also is not a good idea because it violates the basic ManifoldCF principle that you can safely shut down and restart ManifoldCF at any time without loss. Solr also suffers from lack of a "guaranteed delivery" metaphor, which I've talked to the Solr team about in the past. The Solr commit model currently does not work this way but ManifoldCF really requires it, because without it there is no way to properly implement an incremental crawler. This would mean a significant new Solr feature.
        Hide
        Jan Høydahl added a comment -

        Yea, guess the net effect is about the same if MCF handles the threads or SolrJ does. Guess we could set threadCount=1 and make buffer size configurable. The point of switching to SolrJ would be the assumption that code is more stable and performant. Also SOLR-1565 could make things even faster.

        Show
        Jan Høydahl added a comment - Yea, guess the net effect is about the same if MCF handles the threads or SolrJ does. Guess we could set threadCount=1 and make buffer size configurable. The point of switching to SolrJ would be the assumption that code is more stable and performant. Also SOLR-1565 could make things even faster.
        Hide
        Karl Wright added a comment -

        That's why this ticket was created - to explore using solrj instead of the homegrown code currently in the connector. However, there are issues we need to consider before solrj would be an option. The guaranteed delivery problem is one such. But also if SolrJ spins up its own threads it might well make it difficult to shut ManifoldCF down properly, depending on how those threads are created. Just as it is better to use an application server's thread pool when you are a web application, the same principles apply for threads created by connectors and their supporting libraries. If you have access to ManifoldCF in Action, you might want to have a look at chapters 5 and 6 for details.

        However, that does not rule solrj out, it just means we need to be cautious if and when the Solr connector is transitioned to use it. If you want to explore this in detail by all means feel free - patches are definitely welcome.

        Show
        Karl Wright added a comment - That's why this ticket was created - to explore using solrj instead of the homegrown code currently in the connector. However, there are issues we need to consider before solrj would be an option. The guaranteed delivery problem is one such. But also if SolrJ spins up its own threads it might well make it difficult to shut ManifoldCF down properly, depending on how those threads are created. Just as it is better to use an application server's thread pool when you are a web application, the same principles apply for threads created by connectors and their supporting libraries. If you have access to ManifoldCF in Action, you might want to have a look at chapters 5 and 6 for details. However, that does not rule solrj out, it just means we need to be cautious if and when the Solr connector is transitioned to use it. If you want to explore this in detail by all means feel free - patches are definitely welcome.
        Hide
        Karl Wright added a comment -

        Resolved by CONNECTORS-594.

        Show
        Karl Wright added a comment - Resolved by CONNECTORS-594 .

          People

          • Assignee:
            Karl Wright
            Reporter:
            Karl Wright
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development