Details
Description
One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones.
We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends.
Attachments
Attachments
- NUTCH-1047-1.x-v5.patch
- 86 kB
- Julien Nioche
- NUTCH-1047-1.x-v4.patch
- 85 kB
- Julien Nioche
- NUTCH-1047-1.x-v3.patch
- 121 kB
- Julien Nioche
- NUTCH-1047-1.x-v2.patch
- 68 kB
- Julien Nioche
- NUTCH-1047-1.x-v1.patch
- 80 kB
- Julien Nioche
- NUTCH-1047-1.x-final.patch
- 90 kB
- Julien Nioche
Issue Links
- blocks
-
NUTCH-1527 Port nutch-elasticsearch-indexer to Nutch
- Closed
-
NUTCH-1528 Port nutch-mongodb-indexer to Nutch
- Closed
- is depended upon by
-
NUTCH-1088 Write Solr XML documents
- Open
-
NUTCH-1517 CloudSearch indexer
- Closed
- is related to
-
NUTCH-1446 Port NUTCH-1444 to trunk (Indexing should not create temporary files)
- Open
-
NUTCH-1139 Indexer to delete documents
- Closed
-
NUTCH-656 DeleteDuplicates based on crawlDB only
- Closed
- relates to
-
NUTCH-1568 port pluggable indexing architecture to 2.x
- Closed
Activity
My interest in your last point is a question which I suppose is wide open to discussion. What end-points (generally speaking) are we going to support and formally represent as pluggable entities? What criteria do we make decisions based on?
We'll simply port the existing SOLR indexing to the plugin-based architecture so that people can easily add the backends they need. If there is a widespread need for a specific backend then I suppose someone will contribute patches and it might get committed. It's not like we need to define which backends (not same as endpoints BTW) would be added etc... we are just giving people the possibility of simply adding theirs without having to do a dirty hack of the indexer.
There is currently a growing interest for ElasticSearch and I know of at least one person who's modified the SOLR indexer to get it to work for ES. This would be a good candidate for inclusion, apart from that let's see what people contribute.
It would be nice to have a plugin implementing this endpoint to generate WARC files. There seems to be two different situations though : one where we send docs to servers (SOLR, ES) and one where we generate files. Do we need to handle deletions for the latter? I don't think so but we would need to for the former.
Any thoughts on this? Would it make sense to have 2 different endpoints or not?
Hi Julien,
I'm not sure i get your point exactly but if we don't generate WARC files we:
- don't have to think about the problem you state
- don't create an additional process between Nutch and a search engine
If you'd need WARC files, for some reason, i'd rather have an endpoint for it just like for ES and Solr instead of using WARC files as an intermediate format.
Does your suggestion imply: segment+crawldb > warc files > search engine?
If you'd need WARC files, for some reason, i'd rather have an endpoint for it just like for ES and Solr instead of using WARC files as an intermediate format.
Does your suggestion imply: segment+crawldb > warc files > search engine?
Nope,let's start again
We mentioned in this issue that we'd like to make the indexing backends pluggable in order to simplify the code and make it easier for others to implement alternative backends. We currently have only SOLR, ES is clearly a good candidate and you've rightly pointed out that we could have a XML dump of the docs. I would add that we could plug in JDBC or HBase etc... WARC is just another example of something we could have as a plugin.
The question was : is there a functional difference between say [XML|WARC] and [SOLR|ES]? For instance the plugin endpoint for SOLR|ES would need to handle deletetions, not the XML or WARC one. Are there any more such differences? Is is an index vs dump issue? A remote vs local one? Would it make sense to have on one hand an indexer with plugins supporting deletions and expecting a URL and on the other a separate job for converting segments and crawldb to XML, WARC etc...
Does it make more sense?
Ah yes it makes sense now!
If you look at the patch for NUTCH-1139 you can see that the endpoint, Solr in this case, implements the delete method as called from NutchIndexAction. Another endpoint could simply ignore and do nothing but write out WARC or Solr XML files.
The class NutchIndexWriter and NutchIndexWriterFactory already provide us with the type of abstraction we need. We could turn the interface NutchIndexWriter into an endpoint and add the methods we need (e.g. delete). What is not clear yet is what IndexerOutputFormat is used for and whether we will be able to use implementations of NutchIndexWriter from within a plugin.
Changing NutchIndexWriter into an endpoint looks like the best solution to have a pluggable indexing backend.
> "What is not clear yet is what IndexerOutputFormat is used for"
More or less what it is used now for? (A bridge for mapreduce code to write documents to indexwriters). What I've changed in Nutch2.x is that IndexerOutputFormat does not extend from FileOutputFormat anymore. (Since many indexers do not use the filesystem at all and the temporary files that were written anyway are unnecessary). When there will be a filebased implementation again (like the above mentioned XML output indexer) it is always possible to introduce an abstract indexwriter that is used a base for backends that uses the filesystem, i.e. FileIndexWriter or something like that. Open for discussion.
One thing I noticed is that Nutch trunk still uses the old mapreduce API. (Note NUTCH-1219). It is not really a blocker, but since Nutchgora is using the new API, it will cause some differences in implementation for trunk and Nutch2. For now I think it would be okay to ignore Nutch2 and make an implementation for trunk first. (I'm happy to make a port to Nutch2 afterwards).
> "whether we will be able to use implementations of NutchIndexWriter from within a plugin"
What do you mean with this?
I did not mean to confuse people by using Nutchgora and Nutch2 in the same context. Of course they are just the same thing
Thanks for your comments Ferdy
What I've changed in Nutch2.x is that IndexerOutputFormat does not extend from FileOutputFormat anymore.
would be good to do the same for 1.x
"whether we will be able to use implementations of NutchIndexWriter from within a plugin"
What do you mean with this?
I meant that we need to check whether we can have the NutchIndexWriter implementations available in a plugin, which would be nice as we'd have our generic commands + the indexing endpoints implementations in their respective plugins (e.g. indexer-SOLR, indexer-ES) etc...
Ah yes I think that is what we should aim for. This work well with how most of the times users add functionality: Simply copy an existing plugin and change it to suit their custom needs.
This is work in progress.
This patch creates a new endpoint (IndexWriter) that plugins can implement. Comes with one such plugin (indexer-solr) and generic code for replacing the index and delete jobs. Haven't tested very much. The main difference is that the SOLR URL must be passed as a Hadoop param e.g. -D solr.server.url. It could also be put in the nutch-site.xml once and for all.
There will be some cleaning to do once this is stable to remove the SOLR stuff in the core code etc...
Please have a look and let me know your thoughts on this
new version of the patch which removes all SOLR related stuff from the core.
The crawl class assumes that solr is used (but this can be changed) and does not do the SOLR dedup anymore. We'll need a better mechanism for the dedup as the existing one is SOLR centric and not very scalable.
Quite a drastic modification of the code, but should be for the best.
Please give it a try and let me know your thoughts.
PS: you might need to delete the index.solr package by hand
Cleaner version of the patch which removes the content from the solr package, adds the dependencies to the indexer-solr plugin in the plugin.xml definition and changes the nutch script so that the SOLR related commands work in the same way but using the plugin under the bonnet. A few more things to do e.g. management of the commits when indexing but we are getting there
Very nice Julien! Can you also add update() to the writer interface? See NUTCH-1506. Some impls can do this such as recent Solr commits. Other impls can defer to add() if applicable or return throw UnsupportedOperation.
Good point Markus, thanks.
The main issue I am struggling with at the moment is what to do with the SOLR deduplication. I don't think we can run a MapReduce job from a plugin so it's not going to work. One (temporary) option would be to leave it as is so that the crawl command works as expected as well as the crawl script and the nutch command and we then get rid of it when we have a generic deduplication job.
I had an issue with dedup too in NUTCH-1480, unless we do something about it i cannot commit that. Personally i'd prefer to never touch that class again but keep it as legacy. What do you think?
We definitely need a better mechanism for deduplication. +1 to leave as is for now until we have a better option. Slightly annoying for this issue is that it means adding it back to the main classes as well as SOLR as dependency, not a big deal though.
Alright, i'll skip dedup for NUTCH-1480 and see if i can send it in and work on NUTCH-1377.
Are you sure you cannot run a MapReduce program from within a plugin? I think it's worth trying
Tried, failed.
Re- other issues : wouldn't it make sense to do NUTCH-1047 first before you improve the SOLR-backends?
too bad.
I'm not sure, at least 1480 is ready but fine by me. too bad i'll have to rewrite the patches then
Should not be a big deal as the classes affected by NUTCH-1480 are not modified that much by NUTCH-1047 and it also means that you'll get to look at the code for this issue which is a good way of reviewing it
which is a good way of reviewing it
Cheers! Looking forward to your new patch.
my suggestion was that you give NUTCH-1047 a try, wait until it is committed then commit your changes to it, not that I'd patch it to include your changes.
BTW have commented on NUTCH-1480
thanks
Julien
First working patch!
Added the SOLRDedup back into the core classes as it does not seem to be possible to run a MapReduce class from within a plugin.
Added 2 new methods to the IndexWriter interface (commit, update) + fixed CleaningJob and nutch script.
Tried on a small crawl with the crawl script and it worked as expected
Excellent work my friend! I'll be sure to test this next week! Hopefully it all works out fine and i can rewrite the other indexing patches with ease.
Cheers!
Hi, i put the patch , but i do not found how to set solrURI, and the class SolrUtils is duplicated in two place, may be in later the DeleteDuplicates will be pluggable in backends too.
Hi Lufeng.
The solrindex command in the nutch script works just as before. You can also invoke the IndexingJob command and pass it the SOLR URL as a Hadoop parameter e.g. -D solr.server.url=xxxxxx
SolrUtils is duplicated indeed because of DeleteDuplicates, which is a SOLR-specific implementation. We need to build a generic deduplicator at some point and it will use the pluggable backends. I decided to leave the SOLR-based one in for now, but if most people don't use it then we should probably shelve it. This is a separate issue though.
Thanks for your comments
Hi Julien, it will be early next week until I can try this patch out. There are numerous hurdles to get over regarding the network security here and I do not quite know the configuration yet. It's top of my Jira TODO though.
Hi Julien,
I am trying out the patch and facing an issue. Maybe I am using it the wrong way. Here is what I did:
After setting up nuch+solr and changing schema.xml as per wiki, I applied the patch. If I dont pass the -D option in crawl command, it throws an exception indicating "Missing SOLR URL". I believe that -solr option along with the url also needs to be provided else it wont perform the indexing part. To run a test crawl, I use this command:
bin/nutch crawl -D solr.server.url=http://localhost:8983/solr/ urls -solr http://localhost:8983/solr/ -depth 5 -topN 5000
It gives me an exception saying: "ERROR: [doc=http://searchhub.org/2009/03/09/nutch-solr/] unknown field 'content'" . I have no clue about this. Can you kindly point out where I went wrong ?
Also, the crawl command above needs the solr url to be specified twice. Is there a way to run it with the solr url being specified just once ?
Hi Tejas
Maybe you don't add -D option with bin/nutch crawl command. they are all used to set the solr.server.url parameter. And the cause of the "unknown field content" error is that maybe you don't config the solr schema.xml correctly. Do you copy the conf/schema.xml in nutch conf to the example/solr/conf directory.
Hi Julien,
I found in bin/nutch there is a line like this "CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1" , But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/.
But Now i found that the corrent command to invoke the IndexingJob command is "bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter".
Hi Lufeng,
You are right. There was a problem with my schema.xml file. I corrected it and now things are working. Thanks !!
@tejasp can reproduce the issue and am looking into it, thanks. Somehow the configuration does not get passed on properly when using the crawl command. Thanks.
Lufeng
But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/.
whether it is passed as a parameter or via configuration should not make much of a difference. Your suggestion also assumes that the indexing backend can be reached via a single URL which is not necessarily the case as it could not need a URL at all or at the opposite need multiple URLs. Better to leave that logic in the configuration and assume that the backends will find whatever they need there.
the corrent command to invoke the IndexingJob command is "bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter".
as explained above we want to keep compatibility with the existing sorlindex command and not change its syntax. Underneath it uses the new code based on plugins but sets the value of the solr config. There is no shortcut for the generic indexing job command in the nutch script yet but we could add one. For now it has to be called in full e.g. bin/nutch org.apache.nutch.indexer.IndexingJob ... which will make sense when we have other indexing backends and not just SOLR.
Think about 'nutch solrindex' as a shortcut for the generic command.
Hi Julien,
After reply from @lufeng, I was able to perform indexing with the crawl command. Here is a summary of things I have observed:
"solr.server.url" in nutch-site.xml | "-D" in crawl command | Works ? |
---|---|---|
no | no | RuntimeException: Missing SOLR URL |
no | yes | yes |
yes | no | yes |
yes | yes | yes |
Note that I had to pass "-solr" and solr url everytime. Else it didnt invoke indexing.
Hi Tejas
It will work everytime you set it in nutch-site.xml. As for setting it with -D in the crawl command - you definitely should not have to do that and this is where the bug is. The problem is that for some reason we value we take from the crawl command is correctly set in the configuration object however the later is reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob line 120).
BTW the crawl command is deprecated and should be removed at some point as we have the crawl script. Could you try using the SOLRIndex command as well as the crawl script while I try and solve the problem with the crawl command?
Thanks
Julien
Hi Julien, The solrindex commmand and crawl script are work fine after setting "solr.server.url" in nutch-site.xml. I did not use "-D" option during these runs.
Tejas
The crawl script and the solr index should work without setting "solr.server.url" in nutch-site.xml or using -D as this is handled for you in the nutch script. Can you please test without specifying "solr.server.url" in nutch-site.xml?
Thanks
As some test for the interface started to implement a CSV-indexer - useful for exporting crawled data or for quick analysis. First working version (draft, still a lot to do) within 100+ lines of code: +1 for the interface / extension point.
Some concerns about the usability of IndexingJob as a "daily" tool:
- it's not really transparent which indexer is run (solr, elastic, etc.): you have to look into the property plugin-includes
- options must be passed to indexer plugins as properties: complicated, no help to get a list of available properties
Hi Julien,
As you suggested, I tried to run solrindex command without setting "solr.server.url" in nutch-site.xml or "-D".
Command used:
bin/nutch solrindex http://localhost:8983/solr mycrawl/crawldb/ mycrawl/segments/201301280439/
It says:
Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
The check for number of args is causing this. I corrected it locally and it worked fine after that.
As per usage above, user needs to provide just the crawldb and segment. But user need solrurl to be passed which is consumed by the bin/nutch script. The usage message must be changed to hide this mechanism from user.
wastl-nagel a text based indexer is a good idea. Having one generating data at the format used by CloudSearch see NUTCH-1517 would be cool as well. As for your concerns : most people currently use the SOLR indexer which will still be the one activated by default. I expect a minority of people will try and use something else and if they do then checking which one is activated is no big deal, either via config file or from logs. Passing the options via the config with -D is not very different from using a standard parameter, with the added benefit though that it gives us the possibility to set things in nutch-site.xml once and for all and hence make the commands much simpler. As for the list of properties, they would vary from backend to backend anyway. Each plugin could have a README describing what its options are, compared to having everything in nutch-default.xml at least the descriptions will be contained within the related plugin.
tejasp good catch for the number of args, will fix it. Re-usage message : we could add a getUsage() method to each backend that the generic command will call for all the active indexing plugins. I think the solrindex shortcut is just a temporary measure though until the documentation is up to scratch and the user base has got used to the generic commands.
Thanks for taking the time to share your thoughts, guys.
Fixed bug with checking of arguments length for index command.
Fixed issue with solr param not passed on when using the all-in-one crawl command
Added describe() method to IndexWriter which is called by the IndexingJob and dumps in the log a list of all the active indexingwriters as well as the parameters that they take.
All the issues mentioned previously should now have been fixed. Basically the crawl and the solrindex command should work in exactly the same way as before, so no change from a user point of view but we also get the possiblity to plug new backends.
Please give it a try, would be nice to commit that soon.
The patch v5 is work correctly in nutch 1.6 with solr 3.6. and 4.1. And the configuration file schema-solr4.xml of Sor 4.1 hit a patch of NUTCH-1486.
It will be better if index can report progress.
good job, thanks Julien.
Hi Julien,
The crawl command (with solr option) and solrindex command are working properly now Is there anything else that you think must be verified ?
Hi Tejas
Thank you for taking the time to have a look. The SolrClean command has been modified too to use the plugin architecture and that should be the last thing I think.
Thanks
Julien
Hey Julien,
While running the solrclean command, I followed the old usage given here [0]. It gave an exception. Then I saw the usage and it gave
$ bin/nutch solrclean Usage: CleaningJob <crawldb> [-noCommit]
That did not work too. It just prints the usage if only the crawldb is passed as an argument. I went through the patch and realized that the bin/nutch script considers the first argument as the solr url and then the left over ie. the crawldb is passed to the java code. This is what worked for me:
bin/nutch solrclean <solrurl> <crawldb>
This is different from the old usage given at [0]. We can prevent from changing the ordering of the arguments and preserve the old usage. This can be used in bin/nutch script:
CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2"
and not perform a "shift" after that. Corresponding usage must be modified in the java code too.
Hey Julien, One question: Why is this change not affecting "solrdedup" command (ie. SolrDeleteDuplicates class) ?
Hi Tejas
Good catch, could do
CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1"
shift; shift
There is no change to do in the code Java as it expects only one argument which is the crawldb. Could also get the CleaningJob to log which indexers are available.
re-solrdedup : the explanation is given earlier in this thread. It is a SOLR-specific approach and we can't run a job located in a plugin. The main job file has to be in the core code. We need a better deduplicator anyway
Hi Julien,
One small change in Java class will be to display this usage message to the user:
$ bin/nutch solrclean Usage: CleaningJob <crawldb> <solrurl> [-noCommit]
The current patch doesnt display "solrurl" in the usage.
Tejas,
The CleaningJob is backend-neutral and as such should not expect <solrurl> as a parameter. Same as with the IndexingJob really
Hi Julien,
in overall, all looks good. A first version of the CSV indexer is ready (NUTCH-1541) and works well with the last v5 patch.
One point we should improve is the command-line help. I agree with Tejas that the help should list all required arguments. Of course, you are right the index/cleaning jobs are "backend-neutral" but then it would be preferable to have new commands "index" and "indexclean". They are also required if other indexer back-ends are used. We can keep the "solr*" commands for legacy and because they are handy. A few additional lines to generate the prior help text are tolerable and could avoid unnecessary user requests on the mailing list.
The describe() method is a good idea. The new commands will then show sufficient help but IndexingJob/CleaningJob should also call describe() when help is shown!
Some trivialities to get the Java docs right:
- default.properties - need to add the new "plugins.indexer" group with indexer-solr as member
- build.xml - add group referring to "plugins.indexer", add Java doc targets for indexer-solr
Final patch for the records before committing. Have added generic 'index' and 'clean' commands, which call the describe() method of the IndexWriters as part of the usage message.
Have added the Javadoc as suggested by Seb as well as the fix for SOLRClean from Tejas
Committed revision 1453776.
Thanks everyone for the comments and reviews. Let's add some indexing backends now
Integrated in Nutch-trunk-Windows #57 (See https://builds.apache.org/job/Nutch-trunk-Windows/57/)
NUTCH-1047 Pluggable indexing backends (Revision 1453776)
Result = FAILURE
jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1453776
Files :
- /nutch/trunk/CHANGES.txt
- /nutch/trunk/build.xml
- /nutch/trunk/conf/nutch-default.xml
- /nutch/trunk/default.properties
- /nutch/trunk/ivy/ivy.xml
- /nutch/trunk/src/bin/nutch
- /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
- /nutch/trunk/src/plugin/build.xml
- /nutch/trunk/src/plugin/indexer-solr
- /nutch/trunk/src/plugin/indexer-solr/build.xml
- /nutch/trunk/src/plugin/indexer-solr/ivy.xml
- /nutch/trunk/src/plugin/indexer-solr/plugin.xml
- /nutch/trunk/src/plugin/indexer-solr/src
- /nutch/trunk/src/plugin/indexer-solr/src/java
- /nutch/trunk/src/plugin/indexer-solr/src/java/org
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
- /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
Integrated in Nutch-trunk #2144 (See https://builds.apache.org/job/Nutch-trunk/2144/)
NUTCH-1047 Pluggable indexing backends (Revision 1453776)
Result = SUCCESS
jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1453776
Files :
- /nutch/trunk/CHANGES.txt
- /nutch/trunk/build.xml
- /nutch/trunk/conf/nutch-default.xml
- /nutch/trunk/default.properties
- /nutch/trunk/ivy/ivy.xml
- /nutch/trunk/src/bin/nutch
- /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
- /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
- /nutch/trunk/src/plugin/build.xml
- /nutch/trunk/src/plugin/indexer-solr
- /nutch/trunk/src/plugin/indexer-solr/build.xml
- /nutch/trunk/src/plugin/indexer-solr/ivy.xml
- /nutch/trunk/src/plugin/indexer-solr/plugin.xml
- /nutch/trunk/src/plugin/indexer-solr/src
- /nutch/trunk/src/plugin/indexer-solr/src/java
- /nutch/trunk/src/plugin/indexer-solr/src/java/org
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
- /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
- /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
I think the suggestion of generic/example/template map/reduce jobs would be an excellent addition. This is a great idea. In my opinion it would reduce the barrier for entry to users inexperienced in setting up jobs.
My interest in your last point is a question which I suppose is wide open to discussion. What end-points (generally speaking) are we going to support and formally represent as pluggable entities? What criteria do we make decisions based on?