Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1480

SolrIndexer to write to multiple servers.

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      SolrUtils should return an array of SolrServers and read the SolrUrl as a comma delimited list of URL's using Configuration.getString(). SolrWriter should be able to handle this list of SolrServers.

      This is useful if you want to send documents to multiple servers if no replication is available or if you want to send documents to multiple NOCs.

      edit:
      This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this issue allows you to index to multiple SolrCloud clusters at the same time.

        Issue Links

          Activity

          Hide
          markus17 Markus Jelsma added a comment -

          Patch for 1.6 allows the SolrUrl parameter to be a comma-delimited list of URL's. This works for the indexer and clean tools, not the deduplicator. It also slightly improves logging, one can now see how many docs were added or removed.

          Any comments?

          Show
          markus17 Markus Jelsma added a comment - Patch for 1.6 allows the SolrUrl parameter to be a comma-delimited list of URL's. This works for the indexer and clean tools, not the deduplicator. It also slightly improves logging, one can now see how many docs were added or removed. Any comments?
          Hide
          lewismc Lewis John McGibbney added a comment -

          Hi Markus. Can I run multiple Solr servers in psudo distributed mode? If so can you provide a link the a (solr) wiki entry and I will try this out. Also can you clarify between core and servers? I don't see the server terminology listed [0] on the Solr wiki. Thanks
          [0] http://wiki.apache.org/solr/SolrTerminology

          Show
          lewismc Lewis John McGibbney added a comment - Hi Markus. Can I run multiple Solr servers in psudo distributed mode? If so can you provide a link the a (solr) wiki entry and I will try this out. Also can you clarify between core and servers? I don't see the server terminology listed [0] on the Solr wiki. Thanks [0] http://wiki.apache.org/solr/SolrTerminology
          Hide
          jnioche Julien Nioche added a comment -

          Hi Lewis

          Can I run multiple Solr servers in psudo distributed mode?

          SOLR is completely separated from Hadoop and has nothing to do with local vs distrib. You can run serveral instances of SOLR on the same machine if that is your question. Just invoke a different port when starting it from the command line with a separate SOLR home.

          Markus,

          Just to make sure I understand - this sends ALL the documents to ALL the SOLR instances specified, right?

          Show
          jnioche Julien Nioche added a comment - Hi Lewis Can I run multiple Solr servers in psudo distributed mode? SOLR is completely separated from Hadoop and has nothing to do with local vs distrib. You can run serveral instances of SOLR on the same machine if that is your question. Just invoke a different port when starting it from the command line with a separate SOLR home. Markus, Just to make sure I understand - this sends ALL the documents to ALL the SOLR instances specified, right?
          Hide
          markus17 Markus Jelsma added a comment -

          Hi Lewis, Solr does not have a notion of pseudo distributed mode. You can simply run one server with two cores, two servers on different ports or different machines or have a SolrCloud cluster in multiple NOCs, all have a unique URL. A SolrCloud cluster is roughly the same as multiple stand-alone servers working together.

          A core is similar to a table in a SQL server.

          See http://wiki.apache.org/solr/CoreAdmin#Example on how to quickly set up two cores in a single Solr server. Copy Nutch' schema and you have two Nutch indices you can write to in a single go.

          Show
          markus17 Markus Jelsma added a comment - Hi Lewis, Solr does not have a notion of pseudo distributed mode. You can simply run one server with two cores, two servers on different ports or different machines or have a SolrCloud cluster in multiple NOCs, all have a unique URL. A SolrCloud cluster is roughly the same as multiple stand-alone servers working together. A core is similar to a table in a SQL server. See http://wiki.apache.org/solr/CoreAdmin#Example on how to quickly set up two cores in a single Solr server. Copy Nutch' schema and you have two Nutch indices you can write to in a single go.
          Hide
          markus17 Markus Jelsma added a comment -

          Julien, yes. The Nutch tools are modified to work with a List<SolrServer> now. One or more SolrServers are returned depending on how many you specify comma-separated. Communication to Solr is simply done by iterating over the List and repeating the commands such as add or delete or commit.

          Show
          markus17 Markus Jelsma added a comment - Julien, yes. The Nutch tools are modified to work with a List<SolrServer> now. One or more SolrServers are returned depending on how many you specify comma-separated. Communication to Solr is simply done by iterating over the List and repeating the commands such as add or delete or commit.
          Hide
          jnioche Julien Nioche added a comment -

          OK thanks. What about having a mechanism for specifying a way of distributing the docs with the replicate-to-all being one of the options? Could do consistent hashing maybe? I expect that most people would want to shard.

          off topic re-deduplication : I think we've hit the limits of the current mechanism which I assume was based on the one we had when Nutch was managing its own Lucene indices. It's not reasonable to pump ALL the docs from SOLR into Hadoop to dedup and I'd rather have map reduce jobs to find the duplicates based on the crawldb and send the deletion commands to SOLR. And this would work for ElasticSearch as well. Am pretty sure there is a JIRA for this somewhere

          Show
          jnioche Julien Nioche added a comment - OK thanks. What about having a mechanism for specifying a way of distributing the docs with the replicate-to-all being one of the options? Could do consistent hashing maybe? I expect that most people would want to shard. off topic re-deduplication : I think we've hit the limits of the current mechanism which I assume was based on the one we had when Nutch was managing its own Lucene indices. It's not reasonable to pump ALL the docs from SOLR into Hadoop to dedup and I'd rather have map reduce jobs to find the duplicates based on the crawldb and send the deletion commands to SOLR. And this would work for ElasticSearch as well. Am pretty sure there is a JIRA for this somewhere
          Hide
          markus17 Markus Jelsma added a comment -

          I think you mean the justed linked issue NUTCH-1377? I just happen to work on that issue. Using the CloudSolrServer will send the docs to the correct shard already. The way it works now is sending millions of records to a single node which then distributes it again, a waste of IO. That issue will work with this issue so i may want to push them in together. It's just a matter of returning the correct SolrServer instance and working around the HTTPCLient issues.

          agreed on deduplication. It would also be hard for this issue to work with dedup because not all indices may be identical.

          Show
          markus17 Markus Jelsma added a comment - I think you mean the justed linked issue NUTCH-1377 ? I just happen to work on that issue. Using the CloudSolrServer will send the docs to the correct shard already. The way it works now is sending millions of records to a single node which then distributes it again, a waste of IO. That issue will work with this issue so i may want to push them in together. It's just a matter of returning the correct SolrServer instance and working around the HTTPCLient issues. agreed on deduplication. It would also be hard for this issue to work with dedup because not all indices may be identical.
          Hide
          jnioche Julien Nioche added a comment -

          nope. I meant implementing the distribution to the shards on the Nutch side without relying on the CloudSolrServer. Having said that we want to move to SOLR4 and if we get that from SOLR for cheap then that's even better

          Show
          jnioche Julien Nioche added a comment - nope. I meant implementing the distribution to the shards on the Nutch side without relying on the CloudSolrServer. Having said that we want to move to SOLR4 and if we get that from SOLR for cheap then that's even better
          Hide
          lewismc Lewis John McGibbney added a comment -

          Mmm. Additionally there is a problem with this when you run a continuous or large crawl with the crawl script. By default it attempts to dedup and the job fails. Funatemtally though, the patch works with trunk (and Solr 4.0) as you describe... after I've applied the patch in NUTCH-1486

          Show
          lewismc Lewis John McGibbney added a comment - Mmm. Additionally there is a problem with this when you run a continuous or large crawl with the crawl script. By default it attempts to dedup and the job fails. Funatemtally though, the patch works with trunk (and Solr 4.0) as you describe... after I've applied the patch in NUTCH-1486
          Hide
          markus17 Markus Jelsma added a comment -

          Jul, yes, doing it via SolrJ and Zookeeper is much cheaper. We don't need to obtain hash ranges ourselves.

          Lewis, didn't think about that. That's a problem indeed. Although it is technically possible we must not dedup with two indices at the same time, it's going to be trouble. Even with this patch it will still work until someone decides to use the crawl command with two Solrservers comma-separated. Although i don't believe users of the crawl command will write to multiple locations we must prevent them from doing so. Perhaps a simple check for a comma in the solr URL and an exception?

          Something else? Drop deduplication altogether and rely on Solr's internal deduplication (which doesn't work for clouds though)? Other clever ideas?

          Show
          markus17 Markus Jelsma added a comment - Jul, yes, doing it via SolrJ and Zookeeper is much cheaper. We don't need to obtain hash ranges ourselves. Lewis, didn't think about that. That's a problem indeed. Although it is technically possible we must not dedup with two indices at the same time, it's going to be trouble. Even with this patch it will still work until someone decides to use the crawl command with two Solrservers comma-separated. Although i don't believe users of the crawl command will write to multiple locations we must prevent them from doing so. Perhaps a simple check for a comma in the solr URL and an exception? Something else? Drop deduplication altogether and rely on Solr's internal deduplication (which doesn't work for clouds though)? Other clever ideas?
          Hide
          lewismc Lewis John McGibbney added a comment -

          A suggestion fore the dedup job is to read in the comma separated solr cores/servers before dropping everything after the first server within the list. Some simple logging could then indicate that dedup was only attempted and subsequently executed on the first server...

          Show
          lewismc Lewis John McGibbney added a comment - A suggestion fore the dedup job is to read in the comma separated solr cores/servers before dropping everything after the first server within the list. Some simple logging could then indicate that dedup was only attempted and subsequently executed on the first server...
          Hide
          markus17 Markus Jelsma added a comment -

          I'm fine with that too, i don't really care as long it's obvious enough for the user. Any more thoughts to share? Opinions?

          Show
          markus17 Markus Jelsma added a comment - I'm fine with that too, i don't really care as long it's obvious enough for the user. Any more thoughts to share? Opinions?
          Hide
          jnioche Julien Nioche added a comment -

          I'd rather it was implemented as an extension of NUTCH-945 where we'd have a partitioner that sends to all SOLR instances, which is I believe what NUTCH-1480 is about. There are many cases where we'd want to shard according to other criteria and NUTCH-945 would provide a more generic framework. Does this make sense?

          Show
          jnioche Julien Nioche added a comment - I'd rather it was implemented as an extension of NUTCH-945 where we'd have a partitioner that sends to all SOLR instances, which is I believe what NUTCH-1480 is about. There are many cases where we'd want to shard according to other criteria and NUTCH-945 would provide a more generic framework. Does this make sense?
          Hide
          markus17 Markus Jelsma added a comment -

          We leave all the shard key hashing up to SolrCloud and the Cloud aware SolrJ client we use, see NUTCH-1377. We use both patches to and provide two lists of zookeeper addresses to index to multiple clouds the same time. As far as i am concerned NUTCH-945 is obsolete due to SolrCloud handling automatically.

          Show
          markus17 Markus Jelsma added a comment - We leave all the shard key hashing up to SolrCloud and the Cloud aware SolrJ client we use, see NUTCH-1377 . We use both patches to and provide two lists of zookeeper addresses to index to multiple clouds the same time. As far as i am concerned NUTCH-945 is obsolete due to SolrCloud handling automatically.
          Hide
          jnioche Julien Nioche added a comment -

          probably depends on whether we want to support both SOLR 3.x and SOLR 4.x. Got your point about indexing to multiple clouds, thanks!

          Show
          jnioche Julien Nioche added a comment - probably depends on whether we want to support both SOLR 3.x and SOLR 4.x. Got your point about indexing to multiple clouds, thanks!
          Hide
          markus17 Markus Jelsma added a comment -

          I think we should update to 4.0 in 1.7 (can't find the relevant issue). 4.x can work just as 3.x in some stand-alone mode and NUTCH-1377 still allows to index to a single node (or multiple single nodes with this patch included).

          SolrJ 4.x should work fine with 3.x series servers as iirc no javabin changes have been made in the past time.

          Show
          markus17 Markus Jelsma added a comment - I think we should update to 4.0 in 1.7 (can't find the relevant issue). 4.x can work just as 3.x in some stand-alone mode and NUTCH-1377 still allows to index to a single node (or multiple single nodes with this patch included). SolrJ 4.x should work fine with 3.x series servers as iirc no javabin changes have been made in the past time.
          Hide
          kaveh kaveh minooie added a comment -

          I just found this issue today when I was checking to see if what I am about to upload would be a duplicate issue or not and good thing I did since apparently there are quite a few issues about this. But considering that this is the latest one, I will post it here.

          This patch add another plugin, indexer-solrshard, that allows to shard the index data on nutch side. this is mostly geared toward solr 3.x as there are still a few of them are around (including in our production
          environment ), but it could have benefits even with solr 4.x to which I will get.

          it adds two new properties to the nutch config file ( solr.shardkey and solr.server.urls ), the solr.shardkey would the name of the field that should be used to generate the hash code ( and if it is being used against
          solr 3.x should be the uniqekey field in schema file, otherwise the delete would not work properly ), and solr.server.urls would be a comma seperated list of solr core urls or instance urls.
          The plugin divide the hash value by the number of urls to figure out in which core it should put the doucment. it also uses the reset of the solr properties ( commit sieze, etc... ). the code is really the same.
          But the idea behind having a solr.server.urls instead of just using solr.server.url was so that both plugin could be used simultinously which can help in migrating from 3.x to 4.x as well, Though I guess the same
          argument can be made for other properties as well.

          The code use String.hashCode function which is really good enough in terms of evenly distributing docs accross multiple cores ( in our case with about 85 million docs over 8 cores, the diffrence between the number
          of docs in each core is less than 5% ), but changing the hash function or even makeing it customizeable as was suggested in NUTCH-945 is trivial.

          Turning the hasing mechanism off is also trivial ( again, I didn't know about this issue when I was writing this otherwise I would have done it already ) but we can add another property such as solr.usehash and by setting it to false, have the plugin to
          just post the documents to all the servers which could also be quite usefull.

          As for using it against the solr 4.x, it can function as a load balancer. believe me when I say watching 40 reduce jobs try to write to a single solr instance is rather horrifying.

          The patch is against the trunk but porting it to 2.x is trivial ( I actually think that it can probably be applied as it is, but I haven't test it yet )

          Show
          kaveh kaveh minooie added a comment - I just found this issue today when I was checking to see if what I am about to upload would be a duplicate issue or not and good thing I did since apparently there are quite a few issues about this. But considering that this is the latest one, I will post it here. This patch add another plugin, indexer-solrshard, that allows to shard the index data on nutch side. this is mostly geared toward solr 3.x as there are still a few of them are around (including in our production environment ), but it could have benefits even with solr 4.x to which I will get. it adds two new properties to the nutch config file ( solr.shardkey and solr.server.urls ), the solr.shardkey would the name of the field that should be used to generate the hash code ( and if it is being used against solr 3.x should be the uniqekey field in schema file, otherwise the delete would not work properly ), and solr.server.urls would be a comma seperated list of solr core urls or instance urls. The plugin divide the hash value by the number of urls to figure out in which core it should put the doucment. it also uses the reset of the solr properties ( commit sieze, etc... ). the code is really the same. But the idea behind having a solr.server.urls instead of just using solr.server.url was so that both plugin could be used simultinously which can help in migrating from 3.x to 4.x as well, Though I guess the same argument can be made for other properties as well. The code use String.hashCode function which is really good enough in terms of evenly distributing docs accross multiple cores ( in our case with about 85 million docs over 8 cores, the diffrence between the number of docs in each core is less than 5% ), but changing the hash function or even makeing it customizeable as was suggested in NUTCH-945 is trivial. Turning the hasing mechanism off is also trivial ( again, I didn't know about this issue when I was writing this otherwise I would have done it already ) but we can add another property such as solr.usehash and by setting it to false, have the plugin to just post the documents to all the servers which could also be quite usefull. As for using it against the solr 4.x, it can function as a load balancer. believe me when I say watching 40 reduce jobs try to write to a single solr instance is rather horrifying. The patch is against the trunk but porting it to 2.x is trivial ( I actually think that it can probably be applied as it is, but I haven't test it yet )
          Hide
          roannel Roannel Fernández Hernández added a comment -

          I’m testing a solution which use this file [1] to configure the index writers. On this XML file, we could put into every tag "writer" the parameters used by the writer and a mapping for every field of the Nutch documents. With this new way of using the writers in Nutch, we could have so many field mappings, not only for the Solr index writer, but also for every index writer that we have. Also we will be able to define different configurations for index writers, even for the same IndexWriter class. This solution is applied to all types of index writers, not just for Solr index writer.

          The structure of [1] is described in [2].

          [1] https://github.com/r0ann3l/nutch/blob/NUTCH-1480/conf/index-writers.xml.template
          [2] https://github.com/r0ann3l/nutch/blob/NUTCH-1480/conf/index-writers.xsd

          Show
          roannel Roannel Fernández Hernández added a comment - I’m testing a solution which use this file [1] to configure the index writers. On this XML file, we could put into every tag "writer" the parameters used by the writer and a mapping for every field of the Nutch documents. With this new way of using the writers in Nutch, we could have so many field mappings, not only for the Solr index writer, but also for every index writer that we have. Also we will be able to define different configurations for index writers, even for the same IndexWriter class. This solution is applied to all types of index writers, not just for Solr index writer. The structure of [1] is described in [2] . [1] https://github.com/r0ann3l/nutch/blob/NUTCH-1480/conf/index-writers.xml.template [2] https://github.com/r0ann3l/nutch/blob/NUTCH-1480/conf/index-writers.xsd
          Hide
          githubbot ASF GitHub Bot added a comment -

          r0ann3l opened a new pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218

          With this patch now we can have many instances of the same IndexWriter class, but with different configurations. Also, we can copy, rename or remove fields of documents for every index writer individually. Besides, the parameters needed by the index writers will be into separated XML files, so them will be not into nutch-site.xml anymore.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - r0ann3l opened a new pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218 With this patch now we can have many instances of the same IndexWriter class, but with different configurations. Also, we can copy, rename or remove fields of documents for every index writer individually. Besides, the parameters needed by the index writers will be into separated XML files, so them will be not into nutch-site.xml anymore. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          jorgelbg commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#issuecomment-325707916

          This PR includes more changes than the original ticket and breaks BC with custom indexers. @r0ann3l could you squash all the changes into a single commit? that would help in the review process since this PR has a lot of changes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - jorgelbg commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-325707916 This PR includes more changes than the original ticket and breaks BC with custom indexers. @r0ann3l could you squash all the changes into a single commit? that would help in the review process since this PR has a lot of changes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446

          I like this idea. I work in a project that needs to save documents in solr for searching and in elasticsearch for statistics. This solve the problem.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446 I like this idea. I work in a project that needs to save documents in solr for searching and in elasticsearch for statistics. This solve the problem. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446

          I like this idea. I work in a project that needs to save documents in solr for searching and elasticsearch for statistics. This solve the problem.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446 I like this idea. I work in a project that needs to save documents in solr for searching and elasticsearch for statistics. This solve the problem. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          jorgelbg Jorge Luis Betancourt Gonzalez added a comment -

          Markus Jelsma do you mind taking a look at the linked PR? I think that the PR covers more than the original intent of this issue, since you've already worked in something similar, I think that your input would be really valuable on this case.

          Show
          jorgelbg Jorge Luis Betancourt Gonzalez added a comment - Markus Jelsma do you mind taking a look at the linked PR? I think that the PR covers more than the original intent of this issue, since you've already worked in something similar, I think that your input would be really valuable on this case.
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r137120636

          ##########
          File path: src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java
          ##########
          @@ -17,28 +17,26 @@
          package org.apache.nutch.indexwriter.rabbit;

          interface RabbitMQConstants {

          • String RABBIT_PREFIX = "rabbitmq.indexer";

          Review comment:
          Why did you remove all of these?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137120636 ########## File path: src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java ########## @@ -17,28 +17,26 @@ package org.apache.nutch.indexwriter.rabbit; interface RabbitMQConstants { String RABBIT_PREFIX = "rabbitmq.indexer"; Review comment: Why did you remove all of these? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r137121058

          ##########
          File path: src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
          ##########
          @@ -75,19 +70,61 @@
          private int totalUpdates = 0;
          private boolean delete = false;

          + @Override
          public void open(JobConf job, String name) throws IOException

          { - solrClients = SolrUtils.getSolrClients(job); - init(solrClients, job); + //Implementation not required }
          • // package protected for tests
          • void init(List<SolrClient> solrClients, JobConf job) throws IOException {
          • batchSize = job.getInt(SolrConstants.COMMIT_SIZE, 1000);
          • solrMapping = SolrMappingReader.getInstance(job);
          • delete = job.getBoolean(IndexerMapReduce.INDEXER_DELETE, false);
            + /**
            + * Initializes the internal variables from a given index writer configuration.
            + *
            + * @param parameters Params from the index writer configuration.
            + * @throws IOException Some exception thrown by writer.
            + */
            + @Override
            + public void open(Map<String, String> parameters) throws IOException {
            + String type = parameters.getOrDefault("type", "http");
            +
            + String[] urls = StringUtils.getStrings(parameters.get("url"));
            +
            + if (urls == null) { + String message = "Missing SOLR URL.\n" + describe(); + LOG.error(message); + throw new RuntimeException(message); + }

            +
            + this.solrClients = new ArrayList<>();
            +
            + switch (type) {
            + case "http":
            + for (String url : urls)

            { + solrClients.add(SolrUtils.getHttpSolrClient(url)); + }

            + break;
            + case "cloud":
            + for (String url : urls)

            { + CloudSolrClient sc = SolrUtils.getCloudSolrClient(url); + sc.setDefaultCollection(parameters.get(SolrConstants.COLLECTION)); + solrClients.add(sc); + }

            + break;
            + case "concurrent":

          Review comment:
          Can you throw unsupported Exception at this stage? and also a default case?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137121058 ########## File path: src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java ########## @@ -75,19 +70,61 @@ private int totalUpdates = 0; private boolean delete = false; + @Override public void open(JobConf job, String name) throws IOException { - solrClients = SolrUtils.getSolrClients(job); - init(solrClients, job); + //Implementation not required } // package protected for tests void init(List<SolrClient> solrClients, JobConf job) throws IOException { batchSize = job.getInt(SolrConstants.COMMIT_SIZE, 1000); solrMapping = SolrMappingReader.getInstance(job); delete = job.getBoolean(IndexerMapReduce.INDEXER_DELETE, false); + /** + * Initializes the internal variables from a given index writer configuration. + * + * @param parameters Params from the index writer configuration. + * @throws IOException Some exception thrown by writer. + */ + @Override + public void open(Map<String, String> parameters) throws IOException { + String type = parameters.getOrDefault("type", "http"); + + String[] urls = StringUtils.getStrings(parameters.get("url")); + + if (urls == null) { + String message = "Missing SOLR URL.\n" + describe(); + LOG.error(message); + throw new RuntimeException(message); + } + + this.solrClients = new ArrayList<>(); + + switch (type) { + case "http": + for (String url : urls) { + solrClients.add(SolrUtils.getHttpSolrClient(url)); + } + break; + case "cloud": + for (String url : urls) { + CloudSolrClient sc = SolrUtils.getCloudSolrClient(url); + sc.setDefaultCollection(parameters.get(SolrConstants.COLLECTION)); + solrClients.add(sc); + } + break; + case "concurrent": Review comment: Can you throw unsupported Exception at this stage? and also a default case? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r137119842

          ##########
          File path: src/java/org/apache/nutch/indexer/IndexWriterConfig.java
          ##########
          @@ -0,0 +1,95 @@
          +package org.apache.nutch.indexer;
          +
          +import org.w3c.dom.Element;
          +import org.w3c.dom.NodeList;
          +
          +import java.util.*;

          Review comment:
          Please use explicit imports instead of wildcard

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137119842 ########## File path: src/java/org/apache/nutch/indexer/IndexWriterConfig.java ########## @@ -0,0 +1,95 @@ +package org.apache.nutch.indexer; + +import org.w3c.dom.Element; +import org.w3c.dom.NodeList; + +import java.util.*; Review comment: Please use explicit imports instead of wildcard ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r137120319

          ##########
          File path: src/java/org/apache/nutch/indexer/MappingReader.java
          ##########
          @@ -0,0 +1,93 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +package org.apache.nutch.indexer;
          +
          +import org.w3c.dom.Element;
          +import org.w3c.dom.Node;
          +import org.w3c.dom.NodeList;
          +
          +import java.util.*;

          Review comment:
          Do not use wildcard

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137120319 ########## File path: src/java/org/apache/nutch/indexer/MappingReader.java ########## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer; + +import org.w3c.dom.Element; +import org.w3c.dom.Node; +import org.w3c.dom.NodeList; + +import java.util.*; Review comment: Do not use wildcard ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r137119767

          ##########
          File path: src/java/org/apache/nutch/indexer/IndexWriterConfig.java
          ##########
          @@ -0,0 +1,95 @@
          +package org.apache.nutch.indexer;

          Review comment:
          Please add license header

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137119767 ########## File path: src/java/org/apache/nutch/indexer/IndexWriterConfig.java ########## @@ -0,0 +1,95 @@ +package org.apache.nutch.indexer; Review comment: Please add license header ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r137120181

          ##########
          File path: src/java/org/apache/nutch/indexer/IndexWriters.java
          ##########
          @@ -16,132 +16,245 @@
          */
          package org.apache.nutch.indexer;

          -import java.io.IOException;
          -import java.lang.invoke.MethodHandles;
          -import java.util.HashMap;
          -
          import org.apache.hadoop.conf.Configuration;
          import org.apache.hadoop.mapred.JobConf;
          -import org.apache.nutch.indexer.NutchDocument;
          import org.apache.nutch.plugin.Extension;
          import org.apache.nutch.plugin.ExtensionPoint;
          import org.apache.nutch.plugin.PluginRepository;
          import org.apache.nutch.plugin.PluginRuntimeException;
          import org.apache.nutch.util.ObjectCache;
          import org.slf4j.Logger;
          import org.slf4j.LoggerFactory;
          +import org.w3c.dom.Document;
          +import org.w3c.dom.Element;
          +import org.w3c.dom.NodeList;
          +import org.xml.sax.InputSource;
          +import org.xml.sax.SAXException;
          +
          +import javax.xml.parsers.DocumentBuilder;
          +import javax.xml.parsers.DocumentBuilderFactory;
          +import javax.xml.parsers.ParserConfigurationException;
          +import java.io.IOException;
          +import java.io.InputStream;
          +import java.lang.invoke.MethodHandles;
          +import java.util.HashMap;
          +import java.util.List;
          +import java.util.Map;

          -/** Creates and caches

          {@link IndexWriter} implementing plugins. */
          +/**
          + * Creates and caches {@link IndexWriter}

          implementing plugins.
          + */
          public class IndexWriters {

          private static final Logger LOG = LoggerFactory

          • .getLogger(MethodHandles.lookup().lookupClass());
            + .getLogger(MethodHandles.lookup().lookupClass());
          • private IndexWriter[] indexWriters;
            + private HashMap<String, IndexWriterWrapper> indexWriters;

          public IndexWriters(Configuration conf) {
          ObjectCache objectCache = ObjectCache.get(conf);
          +
          synchronized (objectCache) {

          • this.indexWriters = (IndexWriter[]) objectCache
          • .getObject(IndexWriter.class.getName());
            + this.indexWriters = (HashMap<String, IndexWriterWrapper>) objectCache
            + .getObject(IndexWriterWrapper.class.getName());
            +
            + //It's not cached yet
            if (this.indexWriters == null) {
            try {
            ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint(
          • IndexWriter.X_POINT_ID);
          • if (point == null)
            + IndexWriter.X_POINT_ID);
            +
            + if (point == null) { throw new RuntimeException(IndexWriter.X_POINT_ID + " not found."); + }

            +
            Extension[] extensions = point.getExtensions();

          • HashMap<String, IndexWriter> indexerMap = new HashMap<>();
          • for (int i = 0; i < extensions.length; i++) {
          • Extension extension = extensions[i];
          • IndexWriter writer = (IndexWriter) extension.getExtensionInstance();
          • LOG.info("Adding " + writer.getClass().getName());
          • if (!indexerMap.containsKey(writer.getClass().getName())) {
          • indexerMap.put(writer.getClass().getName(), writer);
            +
            + HashMap<String, Extension> extensionMap = new HashMap<>();
            + for (Extension extension : extensions) {
            + LOG.info("Index writer " + extension.getClazz() + " identified.");

          Review comment:
          Please use parameterized logging

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137120181 ########## File path: src/java/org/apache/nutch/indexer/IndexWriters.java ########## @@ -16,132 +16,245 @@ */ package org.apache.nutch.indexer; -import java.io.IOException; -import java.lang.invoke.MethodHandles; -import java.util.HashMap; - import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapred.JobConf; -import org.apache.nutch.indexer.NutchDocument; import org.apache.nutch.plugin.Extension; import org.apache.nutch.plugin.ExtensionPoint; import org.apache.nutch.plugin.PluginRepository; import org.apache.nutch.plugin.PluginRuntimeException; import org.apache.nutch.util.ObjectCache; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import org.w3c.dom.Document; +import org.w3c.dom.Element; +import org.w3c.dom.NodeList; +import org.xml.sax.InputSource; +import org.xml.sax.SAXException; + +import javax.xml.parsers.DocumentBuilder; +import javax.xml.parsers.DocumentBuilderFactory; +import javax.xml.parsers.ParserConfigurationException; +import java.io.IOException; +import java.io.InputStream; +import java.lang.invoke.MethodHandles; +import java.util.HashMap; +import java.util.List; +import java.util.Map; -/** Creates and caches {@link IndexWriter} implementing plugins. */ +/** + * Creates and caches {@link IndexWriter} implementing plugins. + */ public class IndexWriters { private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); + .getLogger(MethodHandles.lookup().lookupClass()); private IndexWriter[] indexWriters; + private HashMap<String, IndexWriterWrapper> indexWriters; public IndexWriters(Configuration conf) { ObjectCache objectCache = ObjectCache.get(conf); + synchronized (objectCache) { this.indexWriters = (IndexWriter[]) objectCache .getObject(IndexWriter.class.getName()); + this.indexWriters = (HashMap<String, IndexWriterWrapper>) objectCache + .getObject(IndexWriterWrapper.class.getName()); + + //It's not cached yet if (this.indexWriters == null) { try { ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint( IndexWriter.X_POINT_ID); if (point == null) + IndexWriter.X_POINT_ID); + + if (point == null) { throw new RuntimeException(IndexWriter.X_POINT_ID + " not found."); + } + Extension[] extensions = point.getExtensions(); HashMap<String, IndexWriter> indexerMap = new HashMap<>(); for (int i = 0; i < extensions.length; i++) { Extension extension = extensions [i] ; IndexWriter writer = (IndexWriter) extension.getExtensionInstance(); LOG.info("Adding " + writer.getClass().getName()); if (!indexerMap.containsKey(writer.getClass().getName())) { indexerMap.put(writer.getClass().getName(), writer); + + HashMap<String, Extension> extensionMap = new HashMap<>(); + for (Extension extension : extensions) { + LOG.info("Index writer " + extension.getClazz() + " identified."); Review comment: Please use parameterized logging ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r137120754

          ##########
          File path: src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
          ##########
          @@ -19,19 +19,23 @@
          public interface SolrConstants {
          public static final String SOLR_PREFIX = "solr.";

          Review comment:
          Any reason to remove all of these as well?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137120754 ########## File path: src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java ########## @@ -19,19 +19,23 @@ public interface SolrConstants { public static final String SOLR_PREFIX = "solr."; Review comment: Any reason to remove all of these as well? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          jorgelbg commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r137141458

          ##########
          File path: src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java
          ##########
          @@ -31,164 +31,182 @@
          import org.slf4j.Logger;
          import org.slf4j.LoggerFactory;

          +import java.lang.invoke.MethodHandles;
          import java.util.*;
          import java.util.concurrent.TimeoutException;

          public class RabbitIndexWriter implements IndexWriter {

          • private String serverHost;
          • private int serverPort;
          • private String serverVirtualHost;
          • private String serverUsername;
          • private String serverPassword;
            + private String serverHost;
            + private int serverPort;
            + private String serverVirtualHost;
            + private String serverUsername;
            + private String serverPassword;
          • private String exchangeServer;
          • private String exchangeType;
            + private String exchangeServer;
            + private String exchangeType;
          • private String queueName;
          • private boolean queueDurable;
          • private String queueRoutingKey;
            + private String queueName;
            + private boolean queueDurable;
            + private String queueRoutingKey;
          • private int commitSize;
            + private int commitSize;
          • public static final Logger LOG = LoggerFactory.getLogger(RabbitIndexWriter.class);
            + private static final Logger LOG = LoggerFactory
            + .getLogger(MethodHandles.lookup().lookupClass());
          • private Configuration config;
            + private Configuration config;
          • private RabbitMessage rabbitMessage = new RabbitMessage();
            + private RabbitMessage rabbitMessage = new RabbitMessage();
          • private Channel channel;
          • private Connection connection;
            + private Channel channel;
            + private Connection connection;
          • @Override
          • public Configuration getConf() { - return config; - }

            + @Override
            + public Configuration getConf()

            { + return config; + }
          • @Override
          • public void setConf(Configuration conf) {
          • config = conf;
            + @Override
            + public void setConf(Configuration conf) { + config = conf; + }
          • serverHost = conf.get(RabbitMQConstants.SERVER_HOST, "localhost");
          • serverPort = conf.getInt(RabbitMQConstants.SERVER_PORT, 15672);
          • serverVirtualHost = conf.get(RabbitMQConstants.SERVER_VIRTUAL_HOST, null);
            + @Override
            + public void open(JobConf JobConf, String name) throws IOException { + //Implementation not required + }
          • serverUsername = conf.get(RabbitMQConstants.SERVER_USERNAME, "admin");
          • serverPassword = conf.get(RabbitMQConstants.SERVER_PASSWORD, "admin");
            + /**
            + * Initializes the internal variables from a given index writer configuration.
            + *
            + * @param parameters Params from the index writer configuration.
            + * @throws IOException Some exception thrown by writer.
            + */
            + @Override
            + public void open(Map<String, String> parameters) throws IOException { + serverHost = parameters.getOrDefault(RabbitMQConstants.SERVER_HOST, "localhost"); + serverPort = Integer.parseInt(parameters.getOrDefault(RabbitMQConstants.SERVER_PORT, "5672")); + serverVirtualHost = parameters.getOrDefault(RabbitMQConstants.SERVER_VIRTUAL_HOST, null); - exchangeServer = conf.get(RabbitMQConstants.EXCHANGE_SERVER, "nutch.exchange"); - exchangeType = conf.get(RabbitMQConstants.EXCHANGE_TYPE, "direct"); + serverUsername = parameters.getOrDefault(RabbitMQConstants.SERVER_USERNAME, "admin"); + serverPassword = parameters.getOrDefault(RabbitMQConstants.SERVER_PASSWORD, "admin"); - queueName = conf.get(RabbitMQConstants.QUEUE_NAME, "nutch.queue"); - queueDurable = conf.getBoolean(RabbitMQConstants.QUEUE_DURABLE, true); - queueRoutingKey = conf.get(RabbitMQConstants.QUEUE_ROUTING_KEY, "nutch.key"); + exchangeServer = parameters.getOrDefault(RabbitMQConstants.EXCHANGE_SERVER, "nutch.exchange"); + exchangeType = parameters.getOrDefault(RabbitMQConstants.EXCHANGE_TYPE, "direct"); - commitSize = conf.getInt(RabbitMQConstants.COMMIT_SIZE, 250); - }

            + queueName = parameters.getOrDefault(RabbitMQConstants.QUEUE_NAME, "nutch.queue");
            + queueDurable = Boolean.parseBoolean(parameters.getOrDefault(RabbitMQConstants.QUEUE_DURABLE, "true"));
            + queueRoutingKey = parameters.getOrDefault(RabbitMQConstants.QUEUE_ROUTING_KEY, "nutch.key");

          • @Override
          • public void open(JobConf JobConf, String name) throws IOException {
          • ConnectionFactory factory = new ConnectionFactory();
          • factory.setHost(serverHost);
          • factory.setPort(serverPort);
            + commitSize = Integer.parseInt(parameters.getOrDefault(RabbitMQConstants.COMMIT_SIZE, "250"));
          • if(serverVirtualHost != null) { - factory.setVirtualHost(serverVirtualHost); - }

            + ConnectionFactory factory = new ConnectionFactory();
            + factory.setHost(serverHost);
            + factory.setPort(serverPort);

          • factory.setUsername(serverUsername);
          • factory.setPassword(serverPassword);
            + if (serverVirtualHost != null) { + factory.setVirtualHost(serverVirtualHost); + }
          • try {
          • connection = factory.newConnection();
          • channel = connection.createChannel();
            + factory.setUsername(serverUsername);
            + factory.setPassword(serverPassword);
          • channel.exchangeDeclare(exchangeServer, exchangeType, true);
          • channel.queueDeclare(queueName, queueDurable, false, false, null);
          • channel.queueBind(queueName, exchangeServer, queueRoutingKey);
            + try { + connection = factory.newConnection(UUID.randomUUID().toString()); + channel = connection.createChannel(); - }

            catch (TimeoutException | IOException ex)

            { - throw makeIOException(ex); - }
          • }
            + channel.exchangeDeclare(exchangeServer, exchangeType, true);
            + channel.queueDeclare(queueName, queueDurable, false, false, null);
            + channel.queueBind(queueName, exchangeServer, queueRoutingKey);
          • @Override
          • public void update(NutchDocument doc) throws IOException {
          • RabbitDocument rabbitDocument = new RabbitDocument();
            -
          • for (final Map.Entry<String, NutchField> e : doc) { - RabbitDocument.RabbitDocumentField field = new RabbitDocument.RabbitDocumentField( - e.getKey(), - e.getValue().getWeight(), - e.getValue().getValues()); - rabbitDocument.addField(field); - }
            - rabbitDocument.setDocumentBoost(doc.getWeight());
            -
            - rabbitMessage.addDocToUpdate(rabbitDocument);
            - if(rabbitMessage.size() >= commitSize) { - commit(); - }
            + } catch (TimeoutException | IOException ex) { + throw makeIOException(ex); }
            -
            - @Override
            - public void commit() throws IOException {
            - if (!rabbitMessage.isEmpty()) { - channel.basicPublish(exchangeServer, queueRoutingKey, null, rabbitMessage.getBytes()); - }
            - rabbitMessage.clear();
            + }
            +
            + @Override
            + public void update(NutchDocument doc) throws IOException {
            + RabbitDocument rabbitDocument = new RabbitDocument();
            +
            + for (final Map.Entry<String, NutchField> e : doc) { + RabbitDocument.RabbitDocumentField field = new RabbitDocument.RabbitDocumentField( + e.getKey(), + e.getValue().getWeight(), + e.getValue().getValues()); + rabbitDocument.addField(field); }
            + rabbitDocument.setDocumentBoost(doc.getWeight());

            - @Override
            - public void write(NutchDocument doc) throws IOException {
            - RabbitDocument rabbitDocument = new RabbitDocument();
            -
            - for (final Map.Entry<String, NutchField> e : doc) {- RabbitDocument.RabbitDocumentField field = new RabbitDocument.RabbitDocumentField(- e.getKey(),- e.getValue().getWeight(),- e.getValue().getValues());- rabbitDocument.addField(field);- }
          • rabbitDocument.setDocumentBoost(doc.getWeight());
            -
          • rabbitMessage.addDocToWrite(rabbitDocument);
            -
          • if(rabbitMessage.size() >= commitSize) { - commit(); - }
            + rabbitMessage.addDocToUpdate(rabbitDocument);
            + if (rabbitMessage.size() >= commitSize) { + commit(); }
            + }

            - @Override
            - public void close() throws IOException {
            - commit();//TODO: This is because indexing job never call commit method. It should be fixed.
            - try { - channel.close(); - connection.close(); - } catch (IOException | TimeoutException e) { - throw makeIOException(e); - }
            + @Override
            + public void commit() throws IOException {
            + if (!rabbitMessage.isEmpty()) { + channel.basicPublish(exchangeServer, queueRoutingKey, null, rabbitMessage.getBytes()); + }
            + rabbitMessage.clear();
            + }
            +
            + @Override
            + public void write(NutchDocument doc) throws IOException {
            + RabbitDocument rabbitDocument = new RabbitDocument();
            +
            + for (final Map.Entry<String, NutchField> e : doc) { + RabbitDocument.RabbitDocumentField field = new RabbitDocument.RabbitDocumentField( + e.getKey(), + e.getValue().getWeight(), + e.getValue().getValues()); + rabbitDocument.addField(field); }
            + rabbitDocument.setDocumentBoost(doc.getWeight());

            - @Override
            - public void delete(String url) throws IOException {
            - rabbitMessage.addDocToDelete(url);
            + rabbitMessage.addDocToWrite(rabbitDocument);

            - if(rabbitMessage.size() >= commitSize) {- commit();- }

            + if (rabbitMessage.size() >= commitSize)

            { + commit(); }
            -
            - private static IOException makeIOException(Exception e) { - return new IOException(e); + }
            +
            + @Override
            + public void close() throws IOException {
            + commit();//TODO: This is because indexing job never call commit method. It should be fixed.
            + try {
            + if(channel.isOpen()) { + channel.close(); + }
            + if(connection.isOpen()) { + connection.close(); + }
            + } catch (IOException | TimeoutException e) { + throw makeIOException(e); }
            + }
            +
            + @Override
            + public void delete(String url) throws IOException {
            + rabbitMessage.addDocToDelete(url);

            - public String describe() {
            - return "RabbitIndexWriter\n" +
            - "\t" + serverHost + ":" + serverPort + " : URL of RabbitMQ server\n" +
            - "\t" + RabbitMQConstants.SERVER_VIRTUAL_HOST + " : Virtualhost name\n" +
            - "\t" + RabbitMQConstants.SERVER_USERNAME + " : Username for authentication\n" +
            - "\t" + RabbitMQConstants.SERVER_PASSWORD + " : Password for authentication\n" +
            - "\t" + RabbitMQConstants.COMMIT_SIZE + " : Buffer size when sending to RabbitMQ (default 250)\n";
            + if (rabbitMessage.size() >= commitSize) {+ commit(); }

            + }
            +
            + private static IOException makeIOException(Exception e)

            { + return new IOException(e); + }

            +
            + public String describe()

            { + return "RabbitIndexWriter\n" + + "\t" + RabbitMQConstants.SERVER_HOST + " : Host of RabbitMQ server\n" + + "\t" + RabbitMQConstants.SERVER_PORT + " : Port of RabbitMQ server\n" + + "\t" + RabbitMQConstants.SERVER_VIRTUAL_HOST + " : Virtualhost name\n" + + "\t" + RabbitMQConstants.SERVER_USERNAME + " : Username for authentication\n" + + "\t" + RabbitMQConstants.SERVER_PASSWORD + " : Password for authentication\n" + + "\t" + RabbitMQConstants.COMMIT_SIZE + " : Buffer size when sending to RabbitMQ (default 250)\n"; + }

          Review comment:
          For consistency probably use the `StringBuilder` pattern here?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - jorgelbg commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137141458 ########## File path: src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java ########## @@ -31,164 +31,182 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.lang.invoke.MethodHandles; import java.util.*; import java.util.concurrent.TimeoutException; public class RabbitIndexWriter implements IndexWriter { private String serverHost; private int serverPort; private String serverVirtualHost; private String serverUsername; private String serverPassword; + private String serverHost; + private int serverPort; + private String serverVirtualHost; + private String serverUsername; + private String serverPassword; private String exchangeServer; private String exchangeType; + private String exchangeServer; + private String exchangeType; private String queueName; private boolean queueDurable; private String queueRoutingKey; + private String queueName; + private boolean queueDurable; + private String queueRoutingKey; private int commitSize; + private int commitSize; public static final Logger LOG = LoggerFactory.getLogger(RabbitIndexWriter.class); + private static final Logger LOG = LoggerFactory + .getLogger(MethodHandles.lookup().lookupClass()); private Configuration config; + private Configuration config; private RabbitMessage rabbitMessage = new RabbitMessage(); + private RabbitMessage rabbitMessage = new RabbitMessage(); private Channel channel; private Connection connection; + private Channel channel; + private Connection connection; @Override public Configuration getConf() { - return config; - } + @Override + public Configuration getConf() { + return config; + } @Override public void setConf(Configuration conf) { config = conf; + @Override + public void setConf(Configuration conf) { + config = conf; + } serverHost = conf.get(RabbitMQConstants.SERVER_HOST, "localhost"); serverPort = conf.getInt(RabbitMQConstants.SERVER_PORT, 15672); serverVirtualHost = conf.get(RabbitMQConstants.SERVER_VIRTUAL_HOST, null); + @Override + public void open(JobConf JobConf, String name) throws IOException { + //Implementation not required + } serverUsername = conf.get(RabbitMQConstants.SERVER_USERNAME, "admin"); serverPassword = conf.get(RabbitMQConstants.SERVER_PASSWORD, "admin"); + /** + * Initializes the internal variables from a given index writer configuration. + * + * @param parameters Params from the index writer configuration. + * @throws IOException Some exception thrown by writer. + */ + @Override + public void open(Map<String, String> parameters) throws IOException { + serverHost = parameters.getOrDefault(RabbitMQConstants.SERVER_HOST, "localhost"); + serverPort = Integer.parseInt(parameters.getOrDefault(RabbitMQConstants.SERVER_PORT, "5672")); + serverVirtualHost = parameters.getOrDefault(RabbitMQConstants.SERVER_VIRTUAL_HOST, null); - exchangeServer = conf.get(RabbitMQConstants.EXCHANGE_SERVER, "nutch.exchange"); - exchangeType = conf.get(RabbitMQConstants.EXCHANGE_TYPE, "direct"); + serverUsername = parameters.getOrDefault(RabbitMQConstants.SERVER_USERNAME, "admin"); + serverPassword = parameters.getOrDefault(RabbitMQConstants.SERVER_PASSWORD, "admin"); - queueName = conf.get(RabbitMQConstants.QUEUE_NAME, "nutch.queue"); - queueDurable = conf.getBoolean(RabbitMQConstants.QUEUE_DURABLE, true); - queueRoutingKey = conf.get(RabbitMQConstants.QUEUE_ROUTING_KEY, "nutch.key"); + exchangeServer = parameters.getOrDefault(RabbitMQConstants.EXCHANGE_SERVER, "nutch.exchange"); + exchangeType = parameters.getOrDefault(RabbitMQConstants.EXCHANGE_TYPE, "direct"); - commitSize = conf.getInt(RabbitMQConstants.COMMIT_SIZE, 250); - } + queueName = parameters.getOrDefault(RabbitMQConstants.QUEUE_NAME, "nutch.queue"); + queueDurable = Boolean.parseBoolean(parameters.getOrDefault(RabbitMQConstants.QUEUE_DURABLE, "true")); + queueRoutingKey = parameters.getOrDefault(RabbitMQConstants.QUEUE_ROUTING_KEY, "nutch.key"); @Override public void open(JobConf JobConf, String name) throws IOException { ConnectionFactory factory = new ConnectionFactory(); factory.setHost(serverHost); factory.setPort(serverPort); + commitSize = Integer.parseInt(parameters.getOrDefault(RabbitMQConstants.COMMIT_SIZE, "250")); if(serverVirtualHost != null) { - factory.setVirtualHost(serverVirtualHost); - } + ConnectionFactory factory = new ConnectionFactory(); + factory.setHost(serverHost); + factory.setPort(serverPort); factory.setUsername(serverUsername); factory.setPassword(serverPassword); + if (serverVirtualHost != null) { + factory.setVirtualHost(serverVirtualHost); + } try { connection = factory.newConnection(); channel = connection.createChannel(); + factory.setUsername(serverUsername); + factory.setPassword(serverPassword); channel.exchangeDeclare(exchangeServer, exchangeType, true); channel.queueDeclare(queueName, queueDurable, false, false, null); channel.queueBind(queueName, exchangeServer, queueRoutingKey); + try { + connection = factory.newConnection(UUID.randomUUID().toString()); + channel = connection.createChannel(); - } catch (TimeoutException | IOException ex) { - throw makeIOException(ex); - } } + channel.exchangeDeclare(exchangeServer, exchangeType, true); + channel.queueDeclare(queueName, queueDurable, false, false, null); + channel.queueBind(queueName, exchangeServer, queueRoutingKey); @Override public void update(NutchDocument doc) throws IOException { RabbitDocument rabbitDocument = new RabbitDocument(); - for (final Map.Entry<String, NutchField> e : doc) { - RabbitDocument.RabbitDocumentField field = new RabbitDocument.RabbitDocumentField( - e.getKey(), - e.getValue().getWeight(), - e.getValue().getValues()); - rabbitDocument.addField(field); - } - rabbitDocument.setDocumentBoost(doc.getWeight()); - - rabbitMessage.addDocToUpdate(rabbitDocument); - if(rabbitMessage.size() >= commitSize) { - commit(); - } + } catch (TimeoutException | IOException ex) { + throw makeIOException(ex); } - - @Override - public void commit() throws IOException { - if (!rabbitMessage.isEmpty()) { - channel.basicPublish(exchangeServer, queueRoutingKey, null, rabbitMessage.getBytes()); - } - rabbitMessage.clear(); + } + + @Override + public void update(NutchDocument doc) throws IOException { + RabbitDocument rabbitDocument = new RabbitDocument(); + + for (final Map.Entry<String, NutchField> e : doc) { + RabbitDocument.RabbitDocumentField field = new RabbitDocument.RabbitDocumentField( + e.getKey(), + e.getValue().getWeight(), + e.getValue().getValues()); + rabbitDocument.addField(field); } + rabbitDocument.setDocumentBoost(doc.getWeight()); - @Override - public void write(NutchDocument doc) throws IOException { - RabbitDocument rabbitDocument = new RabbitDocument(); - - for (final Map.Entry<String, NutchField> e : doc) {- RabbitDocument.RabbitDocumentField field = new RabbitDocument.RabbitDocumentField(- e.getKey(),- e.getValue().getWeight(),- e.getValue().getValues());- rabbitDocument.addField(field);- } rabbitDocument.setDocumentBoost(doc.getWeight()); - rabbitMessage.addDocToWrite(rabbitDocument); - if(rabbitMessage.size() >= commitSize) { - commit(); - } + rabbitMessage.addDocToUpdate(rabbitDocument); + if (rabbitMessage.size() >= commitSize) { + commit(); } + } - @Override - public void close() throws IOException { - commit();//TODO: This is because indexing job never call commit method. It should be fixed. - try { - channel.close(); - connection.close(); - } catch (IOException | TimeoutException e) { - throw makeIOException(e); - } + @Override + public void commit() throws IOException { + if (!rabbitMessage.isEmpty()) { + channel.basicPublish(exchangeServer, queueRoutingKey, null, rabbitMessage.getBytes()); + } + rabbitMessage.clear(); + } + + @Override + public void write(NutchDocument doc) throws IOException { + RabbitDocument rabbitDocument = new RabbitDocument(); + + for (final Map.Entry<String, NutchField> e : doc) { + RabbitDocument.RabbitDocumentField field = new RabbitDocument.RabbitDocumentField( + e.getKey(), + e.getValue().getWeight(), + e.getValue().getValues()); + rabbitDocument.addField(field); } + rabbitDocument.setDocumentBoost(doc.getWeight()); - @Override - public void delete(String url) throws IOException { - rabbitMessage.addDocToDelete(url); + rabbitMessage.addDocToWrite(rabbitDocument); - if(rabbitMessage.size() >= commitSize) {- commit();- } + if (rabbitMessage.size() >= commitSize) { + commit(); } - - private static IOException makeIOException(Exception e) { - return new IOException(e); + } + + @Override + public void close() throws IOException { + commit();//TODO: This is because indexing job never call commit method. It should be fixed. + try { + if(channel.isOpen()) { + channel.close(); + } + if(connection.isOpen()) { + connection.close(); + } + } catch (IOException | TimeoutException e) { + throw makeIOException(e); } + } + + @Override + public void delete(String url) throws IOException { + rabbitMessage.addDocToDelete(url); - public String describe() { - return "RabbitIndexWriter\n" + - "\t" + serverHost + ":" + serverPort + " : URL of RabbitMQ server\n" + - "\t" + RabbitMQConstants.SERVER_VIRTUAL_HOST + " : Virtualhost name\n" + - "\t" + RabbitMQConstants.SERVER_USERNAME + " : Username for authentication\n" + - "\t" + RabbitMQConstants.SERVER_PASSWORD + " : Password for authentication\n" + - "\t" + RabbitMQConstants.COMMIT_SIZE + " : Buffer size when sending to RabbitMQ (default 250)\n"; + if (rabbitMessage.size() >= commitSize) {+ commit(); } + } + + private static IOException makeIOException(Exception e) { + return new IOException(e); + } + + public String describe() { + return "RabbitIndexWriter\n" + + "\t" + RabbitMQConstants.SERVER_HOST + " : Host of RabbitMQ server\n" + + "\t" + RabbitMQConstants.SERVER_PORT + " : Port of RabbitMQ server\n" + + "\t" + RabbitMQConstants.SERVER_VIRTUAL_HOST + " : Virtualhost name\n" + + "\t" + RabbitMQConstants.SERVER_USERNAME + " : Username for authentication\n" + + "\t" + RabbitMQConstants.SERVER_PASSWORD + " : Password for authentication\n" + + "\t" + RabbitMQConstants.COMMIT_SIZE + " : Buffer size when sending to RabbitMQ (default 250)\n"; + } Review comment: For consistency probably use the `StringBuilder` pattern here? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          jorgelbg commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r137142167

          ##########
          File path: conf/index-writers.xsd
          ##########
          @@ -0,0 +1,179 @@
          +<?xml version="1.0" encoding="UTF-8"?>
          +<!--
          + Licensed to the Apache Software Foundation (ASF) under one or more
          + contributor license agreements. See the NOTICE file distributed with
          + this work for additional information regarding copyright ownership.
          + The ASF licenses this file to You under the Apache License, Version 2.0
          + (the "License"); you may not use this file except in compliance with
          + the License. You may obtain a copy of the License at
          +
          + http://www.apache.org/licenses/LICENSE-2.0
          +
          + Unless required by applicable law or agreed to in writing, software
          + distributed under the License is distributed on an "AS IS" BASIS,
          + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + See the License for the specific language governing permissions and
          + limitations under the License.
          +-->
          +<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
          + targetNamespace="http://lucene.apache.org/nutch"
          + xmlns="http://lucene.apache.org/nutch"
          + elementFormDefault="qualified">
          + <xs:element name="writers">
          + <xs:annotation>
          + <xs:documentation>
          + Root tag of index-writers.xml document. It's a wrapper for the all index writers.
          + </xs:documentation>
          + </xs:annotation>
          + <xs:complexType>
          + <xs:sequence>
          + <xs:element name="writer" type="writerType" maxOccurs="unbounded" minOccurs="1">
          + <xs:annotation>
          + <xs:documentation>
          + Contains the all configuration of a particular index writer.

          Review comment:
          typo? Also it would be a good idea to have empty lines at the end of this file and the `conf/index-writers.xml.template` for git/diff compatiblity.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - jorgelbg commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137142167 ########## File path: conf/index-writers.xsd ########## @@ -0,0 +1,179 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" + targetNamespace="http://lucene.apache.org/nutch" + xmlns="http://lucene.apache.org/nutch" + elementFormDefault="qualified"> + <xs:element name="writers"> + <xs:annotation> + <xs:documentation> + Root tag of index-writers.xml document. It's a wrapper for the all index writers. + </xs:documentation> + </xs:annotation> + <xs:complexType> + <xs:sequence> + <xs:element name="writer" type="writerType" maxOccurs="unbounded" minOccurs="1"> + <xs:annotation> + <xs:documentation> + Contains the all configuration of a particular index writer. Review comment: typo? Also it would be a good idea to have empty lines at the end of this file and the `conf/index-writers.xml.template` for git/diff compatiblity. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          r0ann3l commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r138659435

          ##########
          File path: src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java
          ##########
          @@ -17,28 +17,26 @@
          package org.apache.nutch.indexwriter.rabbit;

          interface RabbitMQConstants {

          • String RABBIT_PREFIX = "rabbitmq.indexer";

          Review comment:
          Hi @lewismc. The prefix is not necessary anymore. The new structure allows us to have the same key of a parameter to many index writers without ambiguity or confusion. The prefix makes a parameter key larger and really I do not believe that is necessary.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - r0ann3l commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r138659435 ########## File path: src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java ########## @@ -17,28 +17,26 @@ package org.apache.nutch.indexwriter.rabbit; interface RabbitMQConstants { String RABBIT_PREFIX = "rabbitmq.indexer"; Review comment: Hi @lewismc. The prefix is not necessary anymore. The new structure allows us to have the same key of a parameter to many index writers without ambiguity or confusion. The prefix makes a parameter key larger and really I do not believe that is necessary. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#issuecomment-329219778

          Thanks @lewismc and @jorgelbg for your reviews. All your comments have been fixed.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-329219778 Thanks @lewismc and @jorgelbg for your reviews. All your comments have been fixed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#issuecomment-329563051

          Tested this on Solr 6 and works well... any comments folks?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-329563051 Tested this on Solr 6 and works well... any comments folks? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#issuecomment-330784463

          Looks good! I've tried to use indexer-dummy with this PR applied - it took long to configure the index-writers.xml properly, so we should definitely add "stub" sections for all index writers which are (still) based on configuration properties. All index writers should work out-of-the-box!

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-330784463 Looks good! I've tried to use indexer-dummy with this PR applied - it took long to configure the index-writers.xml properly, so we should definitely add "stub" sections for all index writers which are (still) based on configuration properties. All index writers should work out-of-the-box! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l
          URL: https://github.com/apache/nutch/pull/218#discussion_r139908742

          ##########
          File path: conf/index-writers.xml.template
          ##########
          @@ -0,0 +1,75 @@
          +<?xml version="1.0" encoding="UTF-8" ?>
          +<!--
          + Licensed to the Apache Software Foundation (ASF) under one or more
          + contributor license agreements. See the NOTICE file distributed with
          + this work for additional information regarding copyright ownership.
          + The ASF licenses this file to You under the Apache License, Version 2.0
          + (the "License"); you may not use this file except in compliance with
          + the License. You may obtain a copy of the License at
          +
          + http://www.apache.org/licenses/LICENSE-2.0
          +
          + Unless required by applicable law or agreed to in writing, software
          + distributed under the License is distributed on an "AS IS" BASIS,
          + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + See the License for the specific language governing permissions and
          + limitations under the License.
          +-->
          +<writers xmlns="http://lucene.apache.org/nutch"
          + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          + xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd">
          +
          + <writer id="indexer_solr_1" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
          + <parameters>
          + <param name="type" value="http"/>
          + <param name="url" value="http://localhost:8983/solr/core_name"/>
          + <param name="commitSize" value="250"/>
          + <param name="commitIndex" value="true"/>
          + <param name="auth" value="false"/>
          + <param name="username" value="username"/>
          + <param name="password" value="password"/>
          + </parameters>
          + <mapping>
          + <copy>
          + <field source="content" dest="search"/>
          + <field source="title" dest="title,search"/>
          + </copy>
          + <rename>
          + <field source="metatag.description" dest="description"/>
          + <field source="metatag.keywords" dest="keywords"/>
          + </rename>
          + <remove>
          + <field source="segment"/>
          + </remove>
          + </mapping>
          + </writer>
          + <writer id="indexer_rabbit_1" class="org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter">
          + <parameters>
          + <param name="server.host" value="localhost"/>
          + <param name="server.port" value="5672"/>
          + <param name="server.virtualhost" value="nutch"/>
          + <param name="server.username" value="admin"/>
          + <param name="server.password" value="admin"/>
          + <param name="exchange.server" value="nutch.exchange"/>
          + <param name="exchange.type" value="direct"/>
          + <param name="queue.name" value="nutch.queue"/>
          + <param name="queue.durable" value="true"/>
          + <param name="queue.routingkey" value="nutch.key"/>
          + <param name="commit.size" value="250"/>
          + </parameters>
          + <mapping>
          + <copy>
          + <field source="title" dest="title,search"/>
          + </copy>
          + <rename>
          + <field source="metatag.description" dest="description"/>
          + <field source="metatag.keywords" dest="keywords"/>
          + </rename>
          + <remove>
          + <field source="content"/>
          + <field source="segment"/>
          + <field source="boost"/>
          + </remove>
          + </mapping>
          + </writer>

          Review comment:
          Add stub <writer> sections for all indexer-* plugins so that they work out-of-the-box without modifications of the index-writers.xml required, e.g. for indexer-dummy:
          ```
          <writer id="indexer_dummy_1" class="org.apache.nutch.indexwriter.dummy.DummyIndexWriter">
          <parameters>
          <param name="dummy-dummy" value="dummy-dummy"/>
          </parameters>
          <mapping>
          <copy>
          <field source="dummy-dummy" dest="dummy-dummy"/>
          </copy>
          <rename/>
          <remove/>
          </mapping>
          </writer>
          ```
          That's long for a dummy section, but the schema (index-writers.xsd) and the IndexWriters class requires all the elements and attributes. Maybe it's better to "relax" the schema, make elements/attributes optional and make IndexWriters not fail with NPEs.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r139908742 ########## File path: conf/index-writers.xml.template ########## @@ -0,0 +1,75 @@ +<?xml version="1.0" encoding="UTF-8" ?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<writers xmlns="http://lucene.apache.org/nutch" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd"> + + <writer id="indexer_solr_1" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter"> + <parameters> + <param name="type" value="http"/> + <param name="url" value="http://localhost:8983/solr/core_name"/> + <param name="commitSize" value="250"/> + <param name="commitIndex" value="true"/> + <param name="auth" value="false"/> + <param name="username" value="username"/> + <param name="password" value="password"/> + </parameters> + <mapping> + <copy> + <field source="content" dest="search"/> + <field source="title" dest="title,search"/> + </copy> + <rename> + <field source="metatag.description" dest="description"/> + <field source="metatag.keywords" dest="keywords"/> + </rename> + <remove> + <field source="segment"/> + </remove> + </mapping> + </writer> + <writer id="indexer_rabbit_1" class="org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter"> + <parameters> + <param name="server.host" value="localhost"/> + <param name="server.port" value="5672"/> + <param name="server.virtualhost" value="nutch"/> + <param name="server.username" value="admin"/> + <param name="server.password" value="admin"/> + <param name="exchange.server" value="nutch.exchange"/> + <param name="exchange.type" value="direct"/> + <param name="queue.name" value="nutch.queue"/> + <param name="queue.durable" value="true"/> + <param name="queue.routingkey" value="nutch.key"/> + <param name="commit.size" value="250"/> + </parameters> + <mapping> + <copy> + <field source="title" dest="title,search"/> + </copy> + <rename> + <field source="metatag.description" dest="description"/> + <field source="metatag.keywords" dest="keywords"/> + </rename> + <remove> + <field source="content"/> + <field source="segment"/> + <field source="boost"/> + </remove> + </mapping> + </writer> Review comment: Add stub <writer> sections for all indexer-* plugins so that they work out-of-the-box without modifications of the index-writers.xml required, e.g. for indexer-dummy: ``` <writer id="indexer_dummy_1" class="org.apache.nutch.indexwriter.dummy.DummyIndexWriter"> <parameters> <param name="dummy-dummy" value="dummy-dummy"/> </parameters> <mapping> <copy> <field source="dummy-dummy" dest="dummy-dummy"/> </copy> <rename/> <remove/> </mapping> </writer> ``` That's long for a dummy section, but the schema (index-writers.xsd) and the IndexWriters class requires all the elements and attributes. Maybe it's better to "relax" the schema, make elements/attributes optional and make IndexWriters not fail with NPEs. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org

            People

            • Assignee:
              markus17 Markus Jelsma
              Reporter:
              markus17 Markus Jelsma
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development