Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7
    • Component/s: indexer
    • Labels:
    • Patch Info:
      Patch Available

      Description

      One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones.

      We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends.

      1. NUTCH-1047-1.x-final.patch
        90 kB
        Julien Nioche
      2. NUTCH-1047-1.x-v1.patch
        80 kB
        Julien Nioche
      3. NUTCH-1047-1.x-v2.patch
        68 kB
        Julien Nioche
      4. NUTCH-1047-1.x-v3.patch
        121 kB
        Julien Nioche
      5. NUTCH-1047-1.x-v4.patch
        85 kB
        Julien Nioche
      6. NUTCH-1047-1.x-v5.patch
        86 kB
        Julien Nioche

        Issue Links

          Activity

          Hide
          lewismc Lewis John McGibbney added a comment -

          I think the suggestion of generic/example/template map/reduce jobs would be an excellent addition. This is a great idea. In my opinion it would reduce the barrier for entry to users inexperienced in setting up jobs.

          My interest in your last point is a question which I suppose is wide open to discussion. What end-points (generally speaking) are we going to support and formally represent as pluggable entities? What criteria do we make decisions based on?

          Show
          lewismc Lewis John McGibbney added a comment - I think the suggestion of generic/example/template map/reduce jobs would be an excellent addition. This is a great idea. In my opinion it would reduce the barrier for entry to users inexperienced in setting up jobs. My interest in your last point is a question which I suppose is wide open to discussion. What end-points (generally speaking) are we going to support and formally represent as pluggable entities? What criteria do we make decisions based on?
          Hide
          jnioche Julien Nioche added a comment -

          My interest in your last point is a question which I suppose is wide open to discussion. What end-points (generally speaking) are we going to support and formally represent as pluggable entities? What criteria do we make decisions based on?

          We'll simply port the existing SOLR indexing to the plugin-based architecture so that people can easily add the backends they need. If there is a widespread need for a specific backend then I suppose someone will contribute patches and it might get committed. It's not like we need to define which backends (not same as endpoints BTW) would be added etc... we are just giving people the possibility of simply adding theirs without having to do a dirty hack of the indexer.

          There is currently a growing interest for ElasticSearch and I know of at least one person who's modified the SOLR indexer to get it to work for ES. This would be a good candidate for inclusion, apart from that let's see what people contribute.

          Show
          jnioche Julien Nioche added a comment - My interest in your last point is a question which I suppose is wide open to discussion. What end-points (generally speaking) are we going to support and formally represent as pluggable entities? What criteria do we make decisions based on? We'll simply port the existing SOLR indexing to the plugin-based architecture so that people can easily add the backends they need. If there is a widespread need for a specific backend then I suppose someone will contribute patches and it might get committed. It's not like we need to define which backends (not same as endpoints BTW) would be added etc... we are just giving people the possibility of simply adding theirs without having to do a dirty hack of the indexer. There is currently a growing interest for ElasticSearch and I know of at least one person who's modified the SOLR indexer to get it to work for ES. This would be a good candidate for inclusion, apart from that let's see what people contribute.
          Hide
          jnioche Julien Nioche added a comment -

          It would be nice to have a plugin implementing this endpoint to generate WARC files. There seems to be two different situations though : one where we send docs to servers (SOLR, ES) and one where we generate files. Do we need to handle deletions for the latter? I don't think so but we would need to for the former.

          Any thoughts on this? Would it make sense to have 2 different endpoints or not?

          Show
          jnioche Julien Nioche added a comment - It would be nice to have a plugin implementing this endpoint to generate WARC files. There seems to be two different situations though : one where we send docs to servers (SOLR, ES) and one where we generate files. Do we need to handle deletions for the latter? I don't think so but we would need to for the former. Any thoughts on this? Would it make sense to have 2 different endpoints or not?
          Hide
          markus17 Markus Jelsma added a comment -

          Hi Julien,

          I'm not sure i get your point exactly but if we don't generate WARC files we:

          • don't have to think about the problem you state
          • don't create an additional process between Nutch and a search engine

          If you'd need WARC files, for some reason, i'd rather have an endpoint for it just like for ES and Solr instead of using WARC files as an intermediate format.

          Does your suggestion imply: segment+crawldb > warc files > search engine?

          Show
          markus17 Markus Jelsma added a comment - Hi Julien, I'm not sure i get your point exactly but if we don't generate WARC files we: don't have to think about the problem you state don't create an additional process between Nutch and a search engine If you'd need WARC files, for some reason, i'd rather have an endpoint for it just like for ES and Solr instead of using WARC files as an intermediate format. Does your suggestion imply: segment+crawldb > warc files > search engine?
          Hide
          jnioche Julien Nioche added a comment -

          If you'd need WARC files, for some reason, i'd rather have an endpoint for it just like for ES and Solr instead of using WARC files as an intermediate format.

          Does your suggestion imply: segment+crawldb > warc files > search engine?

          Nope,let's start again
          We mentioned in this issue that we'd like to make the indexing backends pluggable in order to simplify the code and make it easier for others to implement alternative backends. We currently have only SOLR, ES is clearly a good candidate and you've rightly pointed out that we could have a XML dump of the docs. I would add that we could plug in JDBC or HBase etc... WARC is just another example of something we could have as a plugin.

          The question was : is there a functional difference between say [XML|WARC] and [SOLR|ES]? For instance the plugin endpoint for SOLR|ES would need to handle deletetions, not the XML or WARC one. Are there any more such differences? Is is an index vs dump issue? A remote vs local one? Would it make sense to have on one hand an indexer with plugins supporting deletions and expecting a URL and on the other a separate job for converting segments and crawldb to XML, WARC etc...

          Does it make more sense?

          Show
          jnioche Julien Nioche added a comment - If you'd need WARC files, for some reason, i'd rather have an endpoint for it just like for ES and Solr instead of using WARC files as an intermediate format. Does your suggestion imply: segment+crawldb > warc files > search engine? Nope,let's start again We mentioned in this issue that we'd like to make the indexing backends pluggable in order to simplify the code and make it easier for others to implement alternative backends. We currently have only SOLR, ES is clearly a good candidate and you've rightly pointed out that we could have a XML dump of the docs. I would add that we could plug in JDBC or HBase etc... WARC is just another example of something we could have as a plugin. The question was : is there a functional difference between say [XML|WARC] and [SOLR|ES] ? For instance the plugin endpoint for SOLR|ES would need to handle deletetions, not the XML or WARC one. Are there any more such differences? Is is an index vs dump issue? A remote vs local one? Would it make sense to have on one hand an indexer with plugins supporting deletions and expecting a URL and on the other a separate job for converting segments and crawldb to XML, WARC etc... Does it make more sense?
          Hide
          markus17 Markus Jelsma added a comment -

          Ah yes it makes sense now!

          If you look at the patch for NUTCH-1139 you can see that the endpoint, Solr in this case, implements the delete method as called from NutchIndexAction. Another endpoint could simply ignore and do nothing but write out WARC or Solr XML files.

          Show
          markus17 Markus Jelsma added a comment - Ah yes it makes sense now! If you look at the patch for NUTCH-1139 you can see that the endpoint, Solr in this case, implements the delete method as called from NutchIndexAction. Another endpoint could simply ignore and do nothing but write out WARC or Solr XML files.
          Hide
          jnioche Julien Nioche added a comment -

          The class NutchIndexWriter and NutchIndexWriterFactory already provide us with the type of abstraction we need. We could turn the interface NutchIndexWriter into an endpoint and add the methods we need (e.g. delete). What is not clear yet is what IndexerOutputFormat is used for and whether we will be able to use implementations of NutchIndexWriter from within a plugin.

          Show
          jnioche Julien Nioche added a comment - The class NutchIndexWriter and NutchIndexWriterFactory already provide us with the type of abstraction we need. We could turn the interface NutchIndexWriter into an endpoint and add the methods we need (e.g. delete). What is not clear yet is what IndexerOutputFormat is used for and whether we will be able to use implementations of NutchIndexWriter from within a plugin.
          Hide
          markus17 Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          markus17 Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          ferdy.g Ferdy Galema added a comment -

          Changing NutchIndexWriter into an endpoint looks like the best solution to have a pluggable indexing backend.

          > "What is not clear yet is what IndexerOutputFormat is used for"
          More or less what it is used now for? (A bridge for mapreduce code to write documents to indexwriters). What I've changed in Nutch2.x is that IndexerOutputFormat does not extend from FileOutputFormat anymore. (Since many indexers do not use the filesystem at all and the temporary files that were written anyway are unnecessary). When there will be a filebased implementation again (like the above mentioned XML output indexer) it is always possible to introduce an abstract indexwriter that is used a base for backends that uses the filesystem, i.e. FileIndexWriter or something like that. Open for discussion.

          One thing I noticed is that Nutch trunk still uses the old mapreduce API. (Note NUTCH-1219). It is not really a blocker, but since Nutchgora is using the new API, it will cause some differences in implementation for trunk and Nutch2. For now I think it would be okay to ignore Nutch2 and make an implementation for trunk first. (I'm happy to make a port to Nutch2 afterwards).

          > "whether we will be able to use implementations of NutchIndexWriter from within a plugin"
          What do you mean with this?

          Show
          ferdy.g Ferdy Galema added a comment - Changing NutchIndexWriter into an endpoint looks like the best solution to have a pluggable indexing backend. > "What is not clear yet is what IndexerOutputFormat is used for" More or less what it is used now for? (A bridge for mapreduce code to write documents to indexwriters). What I've changed in Nutch2.x is that IndexerOutputFormat does not extend from FileOutputFormat anymore. (Since many indexers do not use the filesystem at all and the temporary files that were written anyway are unnecessary). When there will be a filebased implementation again (like the above mentioned XML output indexer) it is always possible to introduce an abstract indexwriter that is used a base for backends that uses the filesystem, i.e. FileIndexWriter or something like that. Open for discussion. One thing I noticed is that Nutch trunk still uses the old mapreduce API. (Note NUTCH-1219 ). It is not really a blocker, but since Nutchgora is using the new API, it will cause some differences in implementation for trunk and Nutch2. For now I think it would be okay to ignore Nutch2 and make an implementation for trunk first. (I'm happy to make a port to Nutch2 afterwards). > "whether we will be able to use implementations of NutchIndexWriter from within a plugin" What do you mean with this?
          Hide
          ferdy.g Ferdy Galema added a comment -

          I did not mean to confuse people by using Nutchgora and Nutch2 in the same context. Of course they are just the same thing

          Show
          ferdy.g Ferdy Galema added a comment - I did not mean to confuse people by using Nutchgora and Nutch2 in the same context. Of course they are just the same thing
          Hide
          jnioche Julien Nioche added a comment -

          Thanks for your comments Ferdy

          What I've changed in Nutch2.x is that IndexerOutputFormat does not extend from FileOutputFormat anymore.

          would be good to do the same for 1.x

          "whether we will be able to use implementations of NutchIndexWriter from within a plugin"

          What do you mean with this?

          I meant that we need to check whether we can have the NutchIndexWriter implementations available in a plugin, which would be nice as we'd have our generic commands + the indexing endpoints implementations in their respective plugins (e.g. indexer-SOLR, indexer-ES) etc...

          Show
          jnioche Julien Nioche added a comment - Thanks for your comments Ferdy What I've changed in Nutch2.x is that IndexerOutputFormat does not extend from FileOutputFormat anymore. would be good to do the same for 1.x "whether we will be able to use implementations of NutchIndexWriter from within a plugin" What do you mean with this? I meant that we need to check whether we can have the NutchIndexWriter implementations available in a plugin, which would be nice as we'd have our generic commands + the indexing endpoints implementations in their respective plugins (e.g. indexer-SOLR, indexer-ES) etc...
          Hide
          ferdy.g Ferdy Galema added a comment -

          Ah yes I think that is what we should aim for. This work well with how most of the times users add functionality: Simply copy an existing plugin and change it to suit their custom needs.

          Show
          ferdy.g Ferdy Galema added a comment - Ah yes I think that is what we should aim for. This work well with how most of the times users add functionality: Simply copy an existing plugin and change it to suit their custom needs.
          Hide
          jnioche Julien Nioche added a comment -

          This is work in progress.
          This patch creates a new endpoint (IndexWriter) that plugins can implement. Comes with one such plugin (indexer-solr) and generic code for replacing the index and delete jobs. Haven't tested very much. The main difference is that the SOLR URL must be passed as a Hadoop param e.g. -D solr.server.url. It could also be put in the nutch-site.xml once and for all.
          There will be some cleaning to do once this is stable to remove the SOLR stuff in the core code etc...
          Please have a look and let me know your thoughts on this

          Show
          jnioche Julien Nioche added a comment - This is work in progress. This patch creates a new endpoint (IndexWriter) that plugins can implement. Comes with one such plugin (indexer-solr) and generic code for replacing the index and delete jobs. Haven't tested very much. The main difference is that the SOLR URL must be passed as a Hadoop param e.g. -D solr.server.url. It could also be put in the nutch-site.xml once and for all. There will be some cleaning to do once this is stable to remove the SOLR stuff in the core code etc... Please have a look and let me know your thoughts on this
          Hide
          lewismc Lewis John McGibbney added a comment -

          Nice one Julien

          Show
          lewismc Lewis John McGibbney added a comment - Nice one Julien
          Hide
          jnioche Julien Nioche added a comment -

          new version of the patch which removes all SOLR related stuff from the core.
          The crawl class assumes that solr is used (but this can be changed) and does not do the SOLR dedup anymore. We'll need a better mechanism for the dedup as the existing one is SOLR centric and not very scalable.
          Quite a drastic modification of the code, but should be for the best.
          Please give it a try and let me know your thoughts.
          PS: you might need to delete the index.solr package by hand

          Show
          jnioche Julien Nioche added a comment - new version of the patch which removes all SOLR related stuff from the core. The crawl class assumes that solr is used (but this can be changed) and does not do the SOLR dedup anymore. We'll need a better mechanism for the dedup as the existing one is SOLR centric and not very scalable. Quite a drastic modification of the code, but should be for the best. Please give it a try and let me know your thoughts. PS: you might need to delete the index.solr package by hand
          Hide
          jnioche Julien Nioche added a comment -

          Cleaner version of the patch which removes the content from the solr package, adds the dependencies to the indexer-solr plugin in the plugin.xml definition and changes the nutch script so that the SOLR related commands work in the same way but using the plugin under the bonnet. A few more things to do e.g. management of the commits when indexing but we are getting there

          Show
          jnioche Julien Nioche added a comment - Cleaner version of the patch which removes the content from the solr package, adds the dependencies to the indexer-solr plugin in the plugin.xml definition and changes the nutch script so that the SOLR related commands work in the same way but using the plugin under the bonnet. A few more things to do e.g. management of the commits when indexing but we are getting there
          Hide
          markus17 Markus Jelsma added a comment -

          Very nice Julien! Can you also add update() to the writer interface? See NUTCH-1506. Some impls can do this such as recent Solr commits. Other impls can defer to add() if applicable or return throw UnsupportedOperation.

          Show
          markus17 Markus Jelsma added a comment - Very nice Julien! Can you also add update() to the writer interface? See NUTCH-1506 . Some impls can do this such as recent Solr commits. Other impls can defer to add() if applicable or return throw UnsupportedOperation.
          Hide
          jnioche Julien Nioche added a comment -

          Good point Markus, thanks.
          The main issue I am struggling with at the moment is what to do with the SOLR deduplication. I don't think we can run a MapReduce job from a plugin so it's not going to work. One (temporary) option would be to leave it as is so that the crawl command works as expected as well as the crawl script and the nutch command and we then get rid of it when we have a generic deduplication job.

          Show
          jnioche Julien Nioche added a comment - Good point Markus, thanks. The main issue I am struggling with at the moment is what to do with the SOLR deduplication. I don't think we can run a MapReduce job from a plugin so it's not going to work. One (temporary) option would be to leave it as is so that the crawl command works as expected as well as the crawl script and the nutch command and we then get rid of it when we have a generic deduplication job.
          Hide
          markus17 Markus Jelsma added a comment -

          I had an issue with dedup too in NUTCH-1480, unless we do something about it i cannot commit that. Personally i'd prefer to never touch that class again but keep it as legacy. What do you think?

          Show
          markus17 Markus Jelsma added a comment - I had an issue with dedup too in NUTCH-1480 , unless we do something about it i cannot commit that. Personally i'd prefer to never touch that class again but keep it as legacy. What do you think?
          Hide
          jnioche Julien Nioche added a comment -

          We definitely need a better mechanism for deduplication. +1 to leave as is for now until we have a better option. Slightly annoying for this issue is that it means adding it back to the main classes as well as SOLR as dependency, not a big deal though.

          Show
          jnioche Julien Nioche added a comment - We definitely need a better mechanism for deduplication. +1 to leave as is for now until we have a better option. Slightly annoying for this issue is that it means adding it back to the main classes as well as SOLR as dependency, not a big deal though.
          Hide
          markus17 Markus Jelsma added a comment -

          Alright, i'll skip dedup for NUTCH-1480 and see if i can send it in and work on NUTCH-1377.

          Are you sure you cannot run a MapReduce program from within a plugin? I think it's worth trying

          Show
          markus17 Markus Jelsma added a comment - Alright, i'll skip dedup for NUTCH-1480 and see if i can send it in and work on NUTCH-1377 . Are you sure you cannot run a MapReduce program from within a plugin? I think it's worth trying
          Hide
          jnioche Julien Nioche added a comment -

          Tried, failed.
          Re- other issues : wouldn't it make sense to do NUTCH-1047 first before you improve the SOLR-backends?

          Show
          jnioche Julien Nioche added a comment - Tried, failed. Re- other issues : wouldn't it make sense to do NUTCH-1047 first before you improve the SOLR-backends?
          Hide
          markus17 Markus Jelsma added a comment -

          too bad.

          I'm not sure, at least 1480 is ready but fine by me. too bad i'll have to rewrite the patches then

          Show
          markus17 Markus Jelsma added a comment - too bad. I'm not sure, at least 1480 is ready but fine by me. too bad i'll have to rewrite the patches then
          Hide
          jnioche Julien Nioche added a comment -

          Should not be a big deal as the classes affected by NUTCH-1480 are not modified that much by NUTCH-1047 and it also means that you'll get to look at the code for this issue which is a good way of reviewing it

          Show
          jnioche Julien Nioche added a comment - Should not be a big deal as the classes affected by NUTCH-1480 are not modified that much by NUTCH-1047 and it also means that you'll get to look at the code for this issue which is a good way of reviewing it
          Hide
          markus17 Markus Jelsma added a comment -

          which is a good way of reviewing it

          Cheers! Looking forward to your new patch.

          Show
          markus17 Markus Jelsma added a comment - which is a good way of reviewing it Cheers! Looking forward to your new patch.
          Hide
          jnioche Julien Nioche added a comment -

          my suggestion was that you give NUTCH-1047 a try, wait until it is committed then commit your changes to it, not that I'd patch it to include your changes.

          BTW have commented on NUTCH-1480

          thanks

          Julien

          Show
          jnioche Julien Nioche added a comment - my suggestion was that you give NUTCH-1047 a try, wait until it is committed then commit your changes to it, not that I'd patch it to include your changes. BTW have commented on NUTCH-1480 thanks Julien
          Hide
          markus17 Markus Jelsma added a comment -

          no, i understood correctly

          Show
          markus17 Markus Jelsma added a comment - no, i understood correctly
          Hide
          jnioche Julien Nioche added a comment -

          First working patch!
          Added the SOLRDedup back into the core classes as it does not seem to be possible to run a MapReduce class from within a plugin.
          Added 2 new methods to the IndexWriter interface (commit, update) + fixed CleaningJob and nutch script.
          Tried on a small crawl with the crawl script and it worked as expected

          Show
          jnioche Julien Nioche added a comment - First working patch! Added the SOLRDedup back into the core classes as it does not seem to be possible to run a MapReduce class from within a plugin. Added 2 new methods to the IndexWriter interface (commit, update) + fixed CleaningJob and nutch script. Tried on a small crawl with the crawl script and it worked as expected
          Hide
          markus17 Markus Jelsma added a comment -

          Excellent work my friend! I'll be sure to test this next week! Hopefully it all works out fine and i can rewrite the other indexing patches with ease.

          Cheers!

          Show
          markus17 Markus Jelsma added a comment - Excellent work my friend! I'll be sure to test this next week! Hopefully it all works out fine and i can rewrite the other indexing patches with ease. Cheers!
          Hide
          amuseme.lu lufeng added a comment -

          Hi, i put the patch , but i do not found how to set solrURI, and the class SolrUtils is duplicated in two place, may be in later the DeleteDuplicates will be pluggable in backends too.

          Show
          amuseme.lu lufeng added a comment - Hi, i put the patch , but i do not found how to set solrURI, and the class SolrUtils is duplicated in two place, may be in later the DeleteDuplicates will be pluggable in backends too.
          Hide
          jnioche Julien Nioche added a comment -

          Hi Lufeng.

          The solrindex command in the nutch script works just as before. You can also invoke the IndexingJob command and pass it the SOLR URL as a Hadoop parameter e.g. -D solr.server.url=xxxxxx

          SolrUtils is duplicated indeed because of DeleteDuplicates, which is a SOLR-specific implementation. We need to build a generic deduplicator at some point and it will use the pluggable backends. I decided to leave the SOLR-based one in for now, but if most people don't use it then we should probably shelve it. This is a separate issue though.

          Thanks for your comments

          Show
          jnioche Julien Nioche added a comment - Hi Lufeng. The solrindex command in the nutch script works just as before. You can also invoke the IndexingJob command and pass it the SOLR URL as a Hadoop parameter e.g. -D solr.server.url=xxxxxx SolrUtils is duplicated indeed because of DeleteDuplicates, which is a SOLR-specific implementation. We need to build a generic deduplicator at some point and it will use the pluggable backends. I decided to leave the SOLR-based one in for now, but if most people don't use it then we should probably shelve it. This is a separate issue though. Thanks for your comments
          Hide
          lewismc Lewis John McGibbney added a comment -

          Hi Julien, it will be early next week until I can try this patch out. There are numerous hurdles to get over regarding the network security here and I do not quite know the configuration yet. It's top of my Jira TODO though.

          Show
          lewismc Lewis John McGibbney added a comment - Hi Julien, it will be early next week until I can try this patch out. There are numerous hurdles to get over regarding the network security here and I do not quite know the configuration yet. It's top of my Jira TODO though.
          Hide
          tejasp Tejas Patil added a comment -

          Hi Julien,
          I am trying out the patch and facing an issue. Maybe I am using it the wrong way. Here is what I did:
          After setting up nuch+solr and changing schema.xml as per wiki, I applied the patch. If I dont pass the -D option in crawl command, it throws an exception indicating "Missing SOLR URL". I believe that -solr option along with the url also needs to be provided else it wont perform the indexing part. To run a test crawl, I use this command:

          bin/nutch crawl -D solr.server.url=http://localhost:8983/solr/ urls  -solr http://localhost:8983/solr/  -depth 5 -topN 5000

          It gives me an exception saying: "ERROR: [doc=http://searchhub.org/2009/03/09/nutch-solr/] unknown field 'content'" . I have no clue about this. Can you kindly point out where I went wrong ?

          Also, the crawl command above needs the solr url to be specified twice. Is there a way to run it with the solr url being specified just once ?

          Show
          tejasp Tejas Patil added a comment - Hi Julien, I am trying out the patch and facing an issue. Maybe I am using it the wrong way. Here is what I did: After setting up nuch+solr and changing schema.xml as per wiki , I applied the patch. If I dont pass the -D option in crawl command, it throws an exception indicating "Missing SOLR URL" . I believe that -solr option along with the url also needs to be provided else it wont perform the indexing part. To run a test crawl, I use this command: bin/nutch crawl -D solr.server.url=http://localhost:8983/solr/ urls -solr http://localhost:8983/solr/ -depth 5 -topN 5000 It gives me an exception saying: "ERROR: [doc=http://searchhub.org/2009/03/09/nutch-solr/] unknown field 'content'" . I have no clue about this. Can you kindly point out where I went wrong ? Also, the crawl command above needs the solr url to be specified twice. Is there a way to run it with the solr url being specified just once ?
          Hide
          amuseme.lu lufeng added a comment -

          Hi Tejas

          Maybe you don't add -D option with bin/nutch crawl command. they are all used to set the solr.server.url parameter. And the cause of the "unknown field content" error is that maybe you don't config the solr schema.xml correctly. Do you copy the conf/schema.xml in nutch conf to the example/solr/conf directory.

          Show
          amuseme.lu lufeng added a comment - Hi Tejas Maybe you don't add -D option with bin/nutch crawl command. they are all used to set the solr.server.url parameter. And the cause of the "unknown field content" error is that maybe you don't config the solr schema.xml correctly. Do you copy the conf/schema.xml in nutch conf to the example/solr/conf directory.
          Hide
          amuseme.lu lufeng added a comment -

          Hi Julien,

          I found in bin/nutch there is a line like this "CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1" , But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/.

          But Now i found that the corrent command to invoke the IndexingJob command is "bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter".

          Show
          amuseme.lu lufeng added a comment - Hi Julien, I found in bin/nutch there is a line like this "CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1" , But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/ . But Now i found that the corrent command to invoke the IndexingJob command is "bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter".
          Hide
          tejasp Tejas Patil added a comment -

          Hi Lufeng,

          You are right. There was a problem with my schema.xml file. I corrected it and now things are working. Thanks !!

          Show
          tejasp Tejas Patil added a comment - Hi Lufeng, You are right. There was a problem with my schema.xml file. I corrected it and now things are working. Thanks !!
          Hide
          jnioche Julien Nioche added a comment -

          @tejasp can reproduce the issue and am looking into it, thanks. Somehow the configuration does not get passed on properly when using the crawl command. Thanks.

          Lufeng

          But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/.

          whether it is passed as a parameter or via configuration should not make much of a difference. Your suggestion also assumes that the indexing backend can be reached via a single URL which is not necessarily the case as it could not need a URL at all or at the opposite need multiple URLs. Better to leave that logic in the configuration and assume that the backends will find whatever they need there.

          the corrent command to invoke the IndexingJob command is "bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter".

          as explained above we want to keep compatibility with the existing sorlindex command and not change its syntax. Underneath it uses the new code based on plugins but sets the value of the solr config. There is no shortcut for the generic indexing job command in the nutch script yet but we could add one. For now it has to be called in full e.g. bin/nutch org.apache.nutch.indexer.IndexingJob ... which will make sense when we have other indexing backends and not just SOLR.

          Think about 'nutch solrindex' as a shortcut for the generic command.

          Show
          jnioche Julien Nioche added a comment - @tejasp can reproduce the issue and am looking into it, thanks. Somehow the configuration does not get passed on properly when using the crawl command. Thanks. Lufeng But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/ . whether it is passed as a parameter or via configuration should not make much of a difference. Your suggestion also assumes that the indexing backend can be reached via a single URL which is not necessarily the case as it could not need a URL at all or at the opposite need multiple URLs. Better to leave that logic in the configuration and assume that the backends will find whatever they need there. the corrent command to invoke the IndexingJob command is "bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter". as explained above we want to keep compatibility with the existing sorlindex command and not change its syntax. Underneath it uses the new code based on plugins but sets the value of the solr config. There is no shortcut for the generic indexing job command in the nutch script yet but we could add one. For now it has to be called in full e.g. bin/nutch org.apache.nutch.indexer.IndexingJob ... which will make sense when we have other indexing backends and not just SOLR. Think about 'nutch solrindex' as a shortcut for the generic command.
          Hide
          tejasp Tejas Patil added a comment -

          Hi Julien,
          After reply from @lufeng, I was able to perform indexing with the crawl command. Here is a summary of things I have observed:

          "solr.server.url" in nutch-site.xml "-D" in crawl command Works ?
          no no RuntimeException: Missing SOLR URL
          no yes yes
          yes no yes
          yes yes yes

          Note that I had to pass "-solr" and solr url everytime. Else it didnt invoke indexing.

          Show
          tejasp Tejas Patil added a comment - Hi Julien, After reply from @lufeng, I was able to perform indexing with the crawl command. Here is a summary of things I have observed: "solr.server.url" in nutch-site.xml "-D" in crawl command Works ? no no RuntimeException: Missing SOLR URL no yes yes yes no yes yes yes yes Note that I had to pass "-solr" and solr url everytime. Else it didnt invoke indexing.
          Hide
          jnioche Julien Nioche added a comment -

          Hi Tejas

          It will work everytime you set it in nutch-site.xml. As for setting it with -D in the crawl command - you definitely should not have to do that and this is where the bug is. The problem is that for some reason we value we take from the crawl command is correctly set in the configuration object however the later is reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob line 120).

          BTW the crawl command is deprecated and should be removed at some point as we have the crawl script. Could you try using the SOLRIndex command as well as the crawl script while I try and solve the problem with the crawl command?

          Thanks

          Julien

          Show
          jnioche Julien Nioche added a comment - Hi Tejas It will work everytime you set it in nutch-site.xml. As for setting it with -D in the crawl command - you definitely should not have to do that and this is where the bug is. The problem is that for some reason we value we take from the crawl command is correctly set in the configuration object however the later is reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob line 120). BTW the crawl command is deprecated and should be removed at some point as we have the crawl script. Could you try using the SOLRIndex command as well as the crawl script while I try and solve the problem with the crawl command? Thanks Julien
          Hide
          tejasp Tejas Patil added a comment -

          Hi Julien, The solrindex commmand and crawl script are work fine after setting "solr.server.url" in nutch-site.xml. I did not use "-D" option during these runs.

          Show
          tejasp Tejas Patil added a comment - Hi Julien, The solrindex commmand and crawl script are work fine after setting "solr.server.url" in nutch-site.xml. I did not use "-D" option during these runs.
          Hide
          jnioche Julien Nioche added a comment -

          Tejas

          The crawl script and the solr index should work without setting "solr.server.url" in nutch-site.xml or using -D as this is handled for you in the nutch script. Can you please test without specifying "solr.server.url" in nutch-site.xml?

          Thanks

          Show
          jnioche Julien Nioche added a comment - Tejas The crawl script and the solr index should work without setting "solr.server.url" in nutch-site.xml or using -D as this is handled for you in the nutch script. Can you please test without specifying "solr.server.url" in nutch-site.xml? Thanks
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          As some test for the interface started to implement a CSV-indexer - useful for exporting crawled data or for quick analysis. First working version (draft, still a lot to do) within 100+ lines of code: +1 for the interface / extension point.

          Some concerns about the usability of IndexingJob as a "daily" tool:

          • it's not really transparent which indexer is run (solr, elastic, etc.): you have to look into the property plugin-includes
          • options must be passed to indexer plugins as properties: complicated, no help to get a list of available properties
          Show
          wastl-nagel Sebastian Nagel added a comment - As some test for the interface started to implement a CSV-indexer - useful for exporting crawled data or for quick analysis. First working version (draft, still a lot to do) within 100+ lines of code: +1 for the interface / extension point. Some concerns about the usability of IndexingJob as a "daily" tool: it's not really transparent which indexer is run (solr, elastic, etc.): you have to look into the property plugin-includes options must be passed to indexer plugins as properties: complicated, no help to get a list of available properties
          Hide
          tejasp Tejas Patil added a comment -

          Hi Julien,

          As you suggested, I tried to run solrindex command without setting "solr.server.url" in nutch-site.xml or "-D".

          Command used:

          bin/nutch solrindex http://localhost:8983/solr mycrawl/crawldb/ mycrawl/segments/201301280439/

          It says:

          Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]

          The check for number of args is causing this. I corrected it locally and it worked fine after that.
          As per usage above, user needs to provide just the crawldb and segment. But user need solrurl to be passed which is consumed by the bin/nutch script. The usage message must be changed to hide this mechanism from user.

          Show
          tejasp Tejas Patil added a comment - Hi Julien, As you suggested, I tried to run solrindex command without setting "solr.server.url" in nutch-site.xml or "-D". Command used: bin/nutch solrindex http://localhost:8983/solr mycrawl/crawldb/ mycrawl/segments/201301280439/ It says: Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize] The check for number of args is causing this. I corrected it locally and it worked fine after that. As per usage above, user needs to provide just the crawldb and segment. But user need solrurl to be passed which is consumed by the bin/nutch script. The usage message must be changed to hide this mechanism from user.
          Hide
          jnioche Julien Nioche added a comment -

          Sebastian Nagel a text based indexer is a good idea. Having one generating data at the format used by CloudSearch see NUTCH-1517 would be cool as well. As for your concerns : most people currently use the SOLR indexer which will still be the one activated by default. I expect a minority of people will try and use something else and if they do then checking which one is activated is no big deal, either via config file or from logs. Passing the options via the config with -D is not very different from using a standard parameter, with the added benefit though that it gives us the possibility to set things in nutch-site.xml once and for all and hence make the commands much simpler. As for the list of properties, they would vary from backend to backend anyway. Each plugin could have a README describing what its options are, compared to having everything in nutch-default.xml at least the descriptions will be contained within the related plugin.

          Tejas Patil good catch for the number of args, will fix it. Re-usage message : we could add a getUsage() method to each backend that the generic command will call for all the active indexing plugins. I think the solrindex shortcut is just a temporary measure though until the documentation is up to scratch and the user base has got used to the generic commands.

          Thanks for taking the time to share your thoughts, guys.

          Show
          jnioche Julien Nioche added a comment - Sebastian Nagel a text based indexer is a good idea. Having one generating data at the format used by CloudSearch see NUTCH-1517 would be cool as well. As for your concerns : most people currently use the SOLR indexer which will still be the one activated by default. I expect a minority of people will try and use something else and if they do then checking which one is activated is no big deal, either via config file or from logs. Passing the options via the config with -D is not very different from using a standard parameter, with the added benefit though that it gives us the possibility to set things in nutch-site.xml once and for all and hence make the commands much simpler. As for the list of properties, they would vary from backend to backend anyway. Each plugin could have a README describing what its options are, compared to having everything in nutch-default.xml at least the descriptions will be contained within the related plugin. Tejas Patil good catch for the number of args, will fix it. Re-usage message : we could add a getUsage() method to each backend that the generic command will call for all the active indexing plugins. I think the solrindex shortcut is just a temporary measure though until the documentation is up to scratch and the user base has got used to the generic commands. Thanks for taking the time to share your thoughts, guys.
          Hide
          jnioche Julien Nioche added a comment -

          Fixed bug with checking of arguments length for index command.
          Fixed issue with solr param not passed on when using the all-in-one crawl command
          Added describe() method to IndexWriter which is called by the IndexingJob and dumps in the log a list of all the active indexingwriters as well as the parameters that they take.

          All the issues mentioned previously should now have been fixed. Basically the crawl and the solrindex command should work in exactly the same way as before, so no change from a user point of view but we also get the possiblity to plug new backends.

          Please give it a try, would be nice to commit that soon.

          Show
          jnioche Julien Nioche added a comment - Fixed bug with checking of arguments length for index command. Fixed issue with solr param not passed on when using the all-in-one crawl command Added describe() method to IndexWriter which is called by the IndexingJob and dumps in the log a list of all the active indexingwriters as well as the parameters that they take. All the issues mentioned previously should now have been fixed. Basically the crawl and the solrindex command should work in exactly the same way as before, so no change from a user point of view but we also get the possiblity to plug new backends. Please give it a try, would be nice to commit that soon.
          Hide
          amuseme.lu lufeng added a comment -

          The patch v5 is work correctly in nutch 1.6 with solr 3.6. and 4.1. And the configuration file schema-solr4.xml of Sor 4.1 hit a patch of NUTCH-1486.

          It will be better if index can report progress.

          good job, thanks Julien.

          Show
          amuseme.lu lufeng added a comment - The patch v5 is work correctly in nutch 1.6 with solr 3.6. and 4.1. And the configuration file schema-solr4.xml of Sor 4.1 hit a patch of NUTCH-1486 . It will be better if index can report progress. good job, thanks Julien.
          Hide
          tejasp Tejas Patil added a comment -

          Hi Julien,

          The crawl command (with solr option) and solrindex command are working properly now Is there anything else that you think must be verified ?

          Show
          tejasp Tejas Patil added a comment - Hi Julien, The crawl command (with solr option) and solrindex command are working properly now Is there anything else that you think must be verified ?
          Hide
          jnioche Julien Nioche added a comment -

          Hi Tejas

          Thank you for taking the time to have a look. The SolrClean command has been modified too to use the plugin architecture and that should be the last thing I think.

          Thanks

          Julien

          Show
          jnioche Julien Nioche added a comment - Hi Tejas Thank you for taking the time to have a look. The SolrClean command has been modified too to use the plugin architecture and that should be the last thing I think. Thanks Julien
          Hide
          tejasp Tejas Patil added a comment -

          Hey Julien,

          While running the solrclean command, I followed the old usage given here [0]. It gave an exception. Then I saw the usage and it gave

          $ bin/nutch solrclean 
          Usage: CleaningJob <crawldb> [-noCommit]

          That did not work too. It just prints the usage if only the crawldb is passed as an argument. I went through the patch and realized that the bin/nutch script considers the first argument as the solr url and then the left over ie. the crawldb is passed to the java code. This is what worked for me:

          bin/nutch solrclean <solrurl> <crawldb>

          This is different from the old usage given at [0]. We can prevent from changing the ordering of the arguments and preserve the old usage. This can be used in bin/nutch script:

          CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2"

          and not perform a "shift" after that. Corresponding usage must be modified in the java code too.

          [0] : http://wiki.apache.org/nutch/bin/nutch%20solrclean

          Show
          tejasp Tejas Patil added a comment - Hey Julien, While running the solrclean command, I followed the old usage given here [0] . It gave an exception. Then I saw the usage and it gave $ bin/nutch solrclean Usage: CleaningJob <crawldb> [-noCommit] That did not work too. It just prints the usage if only the crawldb is passed as an argument. I went through the patch and realized that the bin/nutch script considers the first argument as the solr url and then the left over ie. the crawldb is passed to the java code. This is what worked for me: bin/nutch solrclean <solrurl> <crawldb> This is different from the old usage given at [0] . We can prevent from changing the ordering of the arguments and preserve the old usage. This can be used in bin/nutch script: CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2" and not perform a "shift" after that. Corresponding usage must be modified in the java code too. [0] : http://wiki.apache.org/nutch/bin/nutch%20solrclean
          Hide
          tejasp Tejas Patil added a comment -

          Hey Julien, One question: Why is this change not affecting "solrdedup" command (ie. SolrDeleteDuplicates class) ?

          Show
          tejasp Tejas Patil added a comment - Hey Julien, One question: Why is this change not affecting "solrdedup" command (ie. SolrDeleteDuplicates class) ?
          Hide
          jnioche Julien Nioche added a comment -

          Hi Tejas

          Good catch, could do


          CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1"
          shift; shift

          There is no change to do in the code Java as it expects only one argument which is the crawldb. Could also get the CleaningJob to log which indexers are available.

          re-solrdedup : the explanation is given earlier in this thread. It is a SOLR-specific approach and we can't run a job located in a plugin. The main job file has to be in the core code. We need a better deduplicator anyway

          Show
          jnioche Julien Nioche added a comment - Hi Tejas Good catch, could do CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1" shift; shift There is no change to do in the code Java as it expects only one argument which is the crawldb. Could also get the CleaningJob to log which indexers are available. re-solrdedup : the explanation is given earlier in this thread. It is a SOLR-specific approach and we can't run a job located in a plugin. The main job file has to be in the core code. We need a better deduplicator anyway
          Hide
          tejasp Tejas Patil added a comment -

          Hi Julien,

          One small change in Java class will be to display this usage message to the user:

          $ bin/nutch solrclean 
          Usage: CleaningJob <crawldb> <solrurl> [-noCommit]

          The current patch doesnt display "solrurl" in the usage.

          Show
          tejasp Tejas Patil added a comment - Hi Julien, One small change in Java class will be to display this usage message to the user: $ bin/nutch solrclean Usage: CleaningJob <crawldb> <solrurl> [-noCommit] The current patch doesnt display "solrurl" in the usage.
          Hide
          jnioche Julien Nioche added a comment -

          Tejas,

          The CleaningJob is backend-neutral and as such should not expect <solrurl> as a parameter. Same as with the IndexingJob really

          Show
          jnioche Julien Nioche added a comment - Tejas, The CleaningJob is backend-neutral and as such should not expect <solrurl> as a parameter. Same as with the IndexingJob really
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Hi Julien,

          in overall, all looks good. A first version of the CSV indexer is ready (NUTCH-1541) and works well with the last v5 patch.

          One point we should improve is the command-line help. I agree with Tejas that the help should list all required arguments. Of course, you are right the index/cleaning jobs are "backend-neutral" but then it would be preferable to have new commands "index" and "indexclean". They are also required if other indexer back-ends are used. We can keep the "solr*" commands for legacy and because they are handy. A few additional lines to generate the prior help text are tolerable and could avoid unnecessary user requests on the mailing list.

          The describe() method is a good idea. The new commands will then show sufficient help but IndexingJob/CleaningJob should also call describe() when help is shown!

          Some trivialities to get the Java docs right:

          • default.properties - need to add the new "plugins.indexer" group with indexer-solr as member
          • build.xml - add group referring to "plugins.indexer", add Java doc targets for indexer-solr
          Show
          wastl-nagel Sebastian Nagel added a comment - Hi Julien, in overall, all looks good. A first version of the CSV indexer is ready ( NUTCH-1541 ) and works well with the last v5 patch. One point we should improve is the command-line help. I agree with Tejas that the help should list all required arguments. Of course, you are right the index/cleaning jobs are "backend-neutral" but then it would be preferable to have new commands "index" and "indexclean". They are also required if other indexer back-ends are used. We can keep the "solr*" commands for legacy and because they are handy. A few additional lines to generate the prior help text are tolerable and could avoid unnecessary user requests on the mailing list. The describe() method is a good idea. The new commands will then show sufficient help but IndexingJob/CleaningJob should also call describe() when help is shown! Some trivialities to get the Java docs right: default.properties - need to add the new "plugins.indexer" group with indexer-solr as member build.xml - add group referring to "plugins.indexer", add Java doc targets for indexer-solr
          Hide
          jnioche Julien Nioche added a comment -

          Final patch for the records before committing. Have added generic 'index' and 'clean' commands, which call the describe() method of the IndexWriters as part of the usage message.
          Have added the Javadoc as suggested by Seb as well as the fix for SOLRClean from Tejas

          Show
          jnioche Julien Nioche added a comment - Final patch for the records before committing. Have added generic 'index' and 'clean' commands, which call the describe() method of the IndexWriters as part of the usage message. Have added the Javadoc as suggested by Seb as well as the fix for SOLRClean from Tejas
          Hide
          jnioche Julien Nioche added a comment -

          Committed revision 1453776.

          Thanks everyone for the comments and reviews. Let's add some indexing backends now

          Show
          jnioche Julien Nioche added a comment - Committed revision 1453776. Thanks everyone for the comments and reviews. Let's add some indexing backends now
          Hide
          hudson Hudson added a comment -

          Integrated in Nutch-trunk-Windows #57 (See https://builds.apache.org/job/Nutch-trunk-Windows/57/)
          NUTCH-1047 Pluggable indexing backends (Revision 1453776)

          Result = FAILURE
          jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1453776
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/build.xml
          • /nutch/trunk/conf/nutch-default.xml
          • /nutch/trunk/default.properties
          • /nutch/trunk/ivy/ivy.xml
          • /nutch/trunk/src/bin/nutch
          • /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
          • /nutch/trunk/src/plugin/build.xml
          • /nutch/trunk/src/plugin/indexer-solr
          • /nutch/trunk/src/plugin/indexer-solr/build.xml
          • /nutch/trunk/src/plugin/indexer-solr/ivy.xml
          • /nutch/trunk/src/plugin/indexer-solr/plugin.xml
          • /nutch/trunk/src/plugin/indexer-solr/src
          • /nutch/trunk/src/plugin/indexer-solr/src/java
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
          • /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
          Show
          hudson Hudson added a comment - Integrated in Nutch-trunk-Windows #57 (See https://builds.apache.org/job/Nutch-trunk-Windows/57/ ) NUTCH-1047 Pluggable indexing backends (Revision 1453776) Result = FAILURE jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1453776 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/build.xml /nutch/trunk/conf/nutch-default.xml /nutch/trunk/default.properties /nutch/trunk/ivy/ivy.xml /nutch/trunk/src/bin/nutch /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java /nutch/trunk/src/plugin/build.xml /nutch/trunk/src/plugin/indexer-solr /nutch/trunk/src/plugin/indexer-solr/build.xml /nutch/trunk/src/plugin/indexer-solr/ivy.xml /nutch/trunk/src/plugin/indexer-solr/plugin.xml /nutch/trunk/src/plugin/indexer-solr/src /nutch/trunk/src/plugin/indexer-solr/src/java /nutch/trunk/src/plugin/indexer-solr/src/java/org /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
          Hide
          hudson Hudson added a comment -

          Integrated in Nutch-trunk #2144 (See https://builds.apache.org/job/Nutch-trunk/2144/)
          NUTCH-1047 Pluggable indexing backends (Revision 1453776)

          Result = SUCCESS
          jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1453776
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/build.xml
          • /nutch/trunk/conf/nutch-default.xml
          • /nutch/trunk/default.properties
          • /nutch/trunk/ivy/ivy.xml
          • /nutch/trunk/src/bin/nutch
          • /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
          • /nutch/trunk/src/plugin/build.xml
          • /nutch/trunk/src/plugin/indexer-solr
          • /nutch/trunk/src/plugin/indexer-solr/build.xml
          • /nutch/trunk/src/plugin/indexer-solr/ivy.xml
          • /nutch/trunk/src/plugin/indexer-solr/plugin.xml
          • /nutch/trunk/src/plugin/indexer-solr/src
          • /nutch/trunk/src/plugin/indexer-solr/src/java
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
          • /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
          • /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
          Show
          hudson Hudson added a comment - Integrated in Nutch-trunk #2144 (See https://builds.apache.org/job/Nutch-trunk/2144/ ) NUTCH-1047 Pluggable indexing backends (Revision 1453776) Result = SUCCESS jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1453776 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/build.xml /nutch/trunk/conf/nutch-default.xml /nutch/trunk/default.properties /nutch/trunk/ivy/ivy.xml /nutch/trunk/src/bin/nutch /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java /nutch/trunk/src/plugin/build.xml /nutch/trunk/src/plugin/indexer-solr /nutch/trunk/src/plugin/indexer-solr/build.xml /nutch/trunk/src/plugin/indexer-solr/ivy.xml /nutch/trunk/src/plugin/indexer-solr/plugin.xml /nutch/trunk/src/plugin/indexer-solr/src /nutch/trunk/src/plugin/indexer-solr/src/java /nutch/trunk/src/plugin/indexer-solr/src/java/org /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
          Hide
          lewismc Lewis John McGibbney added a comment -

          Nice worj Julien.

          Show
          lewismc Lewis John McGibbney added a comment - Nice worj Julien.

            People

            • Assignee:
              jnioche Julien Nioche
              Reporter:
              jnioche Julien Nioche
            • Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development