|
Doğacan Güney made changes - 31/Jul/07 01:19 PM
This simple schema can be used to test solr integration.
Doğacan Güney made changes - 31/Jul/07 02:01 PM
Btw, after this patch it is possible to mix different type of servers in a search application. For example, a(ny number of) solr search server(s), an old index/summary server, an index server and a summary server can work side-by-side in a single application.
New version.
IndexSearcher: Remove deprecated FSDirectory API usage. Remove deprecated FileSystem API usage.
Doğacan Güney made changes - 06/Aug/07 08:45 AM
Oops, attached wrong file. Here is the correct one.
Doğacan Güney made changes - 06/Aug/07 08:47 AM
Doğacan Güney made changes - 06/Aug/07 08:48 AM
Using nutch with solr has been a very demanding request, so it will be very useful when this makes into trunk. I have spend some time reviewing the patch, which I find quite elegant.
Some improvements to the patch would be
As far as i can see, we do not need any metadata for Solr backend, and only need Store,Index and Vector options for lucene backend, so i think we can simplify NutchDocument#metadata. We may implement : class FieldMeta {
o.a.l.document.Field.Store store;
o.a.l.document.Field.Index index;
o.a.l.document.Field.TermVector tv;
}
FieldMeta[] IndexingFilter.getFields();
class NutchDocument {
...
private ArrayList<Field> fieldMeta;
...
}
Or alternatively we may wish to keep add methods of NutchDocument compatible with o.a.l.document.Document, keeping the metadata up-to-date as we add new fields, using this info at LuceneWriter, but ignoring in SolrWriter. This will be slightly slower but the API will be much more intuitive. Due to the method signature bug (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6267833
> make NutchDocument implement VersionedWritable instead of writable, and delegate version checking to superclass
I never quite understood how VersionedWritable is supposed to work. It only checks for a version then throws an exception. If you want your class to behave differently you have to write custom code anyway. > refactor getDetails() methods in HitDetailer to Searcher (it is not likely that a class would implement Searcher but not HitDetailer) I agree with these three but I would like to get feedback from others before making this change. > refactor LuceneSearchBean.VERSION to RPCSearchBean We can't do this because there may be other RPCSearchBean-s besides LuceneSearchBean. So VERSION has to be redefined in every class that implements RPCSearchBean. > remove unrelated changes from the patch.(the changes in NGramProfile, HTMLLanguageParser,LanguageIdentifier,... correct me if i'm wrong) Correct... I will send an updated patch (that also includes Java 5 ExecutorService fixes) > As far as i can see, we do not need any metadata for Solr backend, and only need Store,Index and Vector options for lucene backend, so i think we can simplify NutchDocument#metadata. We may implement : [...] I really don't want to make NutchDocument depend on Lucene. So I would prefer that FieldMeta doesn't depend on Lucene data structures. Because it is possible to implement a non-lucene backend (say, you might want to index to a database) Doğacan - your comments sound good and I'd guess "bean" piece should be stripped.
Do you have a new version of the patch that applies to the trunk? I'd love to see this in soon, as I will need the Nutch+Solr combination in a few weeks (early December)....yeah, this would be a lovely Christmas present! Hi Otis,
I have a new version of patch for trunk(I have also made some other changes), but it is not fully tested yet. Still I am going to post it soon so others can take a look at it. Here is the latest patch. I can tell you that it compiles but I don't know if it runs or not
I don't have time to describe all the new things in the patch, but here is what my git-log command tells me: (Edit: I should add: I will describe changes properly later, I just don't have the time right now
Adds a static stringify(BooleanQuery) method to SolrSearchBean. We can't just call
Added a new monitoring thread to FetchedSegments that watch
Doğacan Güney made changes - 19/Nov/07 09:12 PM
Hi Doğacan,
is it safe to apply this patch to Nutch/Hadoop running on multiple machines (Nutch master/slave with >1 nodes)? Tomislav, I am not sure what you mean.. I wouldn't yet put this patch in a production environment because it may blow up your computers, etc.. but this patch is designed to work in any configuration. So if yours is a development environment then this patch should work fine in your configuration.
Doğacan – can you please explain what you mean by "blog up your computers"?
I didn't look at your latest patch yet, but I thought the Nutch-side of the patch was read-only...no? "blow up your computers" may be a bit of exaggeration
What I mean is that I have tested this patch myself, but it is a big and intrusive one so it can mess with your Solr index. I am not sure what you mean by "read-only" but a large portion of this patch deals with indexing to Solr. So it is not read-only in that sense. Doğacan - ah, good!
The Nutch side of the functionality included in this patch is "read-only". The Solr side is the only side where writes happen. I'm a lot happier to hear that the blowing up is really just messing with Solr/Lucene indices - that is a lot easier to deal with than something going wrong on the Nutch/Hadoop side, at least in my book. Thanks for the clarification! This issue has a lot of votes and a lot of watchers. I'm ready to try this out with a good-sized Nutch+Solr installation and so will be able to see if anything really blows up or if this patch is good to commit.
The patch currently fails. Doğacan, do you happen to have an up to date version? Hi,
Sorry I have been out of loop for a long time. I am working on bringing patch up-to-date with trunk. Should have one ready in a few days. Here is an patch that should apply against latest trunk. I tested it with a small crawl and it seems to work.
There are two small issues: Unit tests do not compile yet (TestDistributedSearch changed in an incompatible way, I will fix this in a follow-up patch), also I had to pull the patch that reloads search-servers.txt file if it is changed (I think it can be applied on top of this patch, so again follow-up patch
Doğacan Güney made changes - 17/Apr/08 02:14 PM
Hey everyone,
I guess I shouldn't be saying this since I have been gone such a long time, but can I get some reviews/testing/suggestion on this patch? I know it is huge (sorry about that Hi,
Tried with nutch svn rev: 650750 and solr svn rev: 652571 - has been working perfectly for around a week in dev env Attaching a trivial change for http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java just adds the -solr argument. thanks,
Caspar MacRae made changes - 12/May/08 08:19 PM
Forgot to add; Indexer wasn't cleaning up tmp working directory (also appended random number to tmp dir name following style of CrawlDbMerger etc)
Caspar MacRae made changes - 14/May/08 02:04 AM
Hi Caspar,
Thanks for the patches and testing! I will merge your changes in a new patch For future reference, can you send your patches as diffs (preferably against latest patch) instead of sending whole file? Hello.
I'm trying to apply this patch and faced a problem that I cannot solve by myself. I checked out nutch trunk (rev 670194), downloaded attachments from this issue and started patching. Crawl.patch performs the following action: and
The main between this patches in second parameter. What can cause that problem and how can I fix it to make nutch index into solr? Thanks. Merged patches for Indexer and Crawler with 442_v5
Julien Nioche made changes - 22/Jun/08 03:24 PM
James Tan made changes - 30/Jul/08 07:51 AM
James Tan made changes - 30/Jul/08 07:52 AM
Here is an updated patch synced with SVN trunk.
A few thoughts and comments:
Thanks for your work.
Guillaume Smet made changes - 05/Aug/08 07:57 AM
Took the liberty of converting the absolute filenames to relative names so you can apply the patch from the Nutch base install directory. Otherwise no changes.
Nick Tkach made changes - 19/Sep/08 03:53 PM
What needs to be done for this to make it into the trunk? I mean, any additional development or testing? Maybe I can be of use.
Thanks to everyone for comments. Unfortunately this patch will probably have to wait until after 1.0 to get in.
But since many people are interested in having some sort of Solr integration in trunk maybe we can update Sami Siren's solr patch and commit it for 1.0. What do others think? I personally believe this patch should be in before 1.0, since it does not make sense to make such a change in 1.1. However since there is some need to test this patch more thoroughly, I guess we can make a branch and commit it there, so that people can test this easily. However branching has it's own problems, especially keeping in sync with trunk would get harder and harder.
Since this issue has a large number of votes and watchers, I suggest we branch and commit it, test this out a little bit more, and merge to trunk before 1.0. +1 on adding this before 1.0 - it would be a shame to miss this functionality when it's been asked for over and over. One change that should be made (either in this patch or as a follow-up) is to use SolrJ instead of plain HTTP.
I don't think we need to branch for this - as long as the patch passes tests and runs basic commands IMHO it's good enough to expose a wider audience to it. Applying this to trunk/ actually gives us better chances that it will be tested by more people. Great!
(I am obviously +1 on adding this before 1.0 So, can I get some reviews on what people think of this patch then? On solrj: I will send an updated patch that uses solrj instead.
Doğacan Güney made changes - 09/Oct/08 11:56 AM
Doğacan Güney made changes - 09/Oct/08 11:57 AM
Latest patch.
Doğacan Güney made changes - 09/Oct/08 12:00 PM
Hi Dogacan,
I have tested your patch and did not have any problems with it. Any chance it could get included in the trunk soon? Thanks! this is an awesome improvement for Nutch Julien Hi,
really great! I use the NUTCH/SOLR - integration patches for a theme-specific search engine and I could never reach this goal without the excellent work. But since the latest patch (v.8), I am not able to put the fetched documents into solr. The build is OK, apache-solr-solrj and apache-solr-common are in /lib, so everything should be fine so far. Do I have to configure solrj or to add something like "nutch index -solr http:...." like in patch v.7? I got this error: Thanks for help Hi Felix,
I have changed the latest patch a bit, that is why you are getting an error Specifically, I have separated solr and lucene indexing mechanisms so now you should run: bin/nutch org.apache.nutch.indexer.solr.SolrIndexer <solr url> <crawldb> <linkdb> <segment> ..... Of course, before committing we can add a short hand to nutch script so you can run: bin/nutch indexsolr .... Hi everybody,
1. in SOLR, the field "cache" is empty. If the NUTCH/SOLR-Integration does not provide this, how can I put the full (and not parsed) html content into my SOLR-Database? I use patch v.8. 2. Is it possible to index a single character, especially "§" (paragraf), with SOLR? Is it only a SOLR-thing or do I additionally have to change something in NUTCH-parser? Thanks for help In the class SOLRIndexer.java should we specify :
job.setReduceSpeculativeExecution(false); around the line 48? otherwise we might have several attempts for a reduce task running at the same time and sending the same documents to the same SOLR instance which is likely to slow down the indexing. SpeculativeExecution does not make the indexing safer as these attempts would all crash in the same way if they receive a SOLRException. Hello everyone!
I am trying to integrate Nutch with Solr by applying the The text below shown in red is my input on the SSH client window: I've just downloaded webby88 /opt/tomcat6/webapps/nutch: patch < The next patch would delete the file TestDistributedSearch.java,
Too many similar 'file cannot be found' errors here, so errors cut off. File to patch: When I tried to run 'ant war' in the nutch installation directory, I got this error: BUILD FAILED I wonder if my way of applying this patch is correct or not. Could you please give me some correction if I did wrong? My system is CentOS 5.2 by the way. I will be out of the office until January 26th.
Committed in rev. 733738.
Besides syncing with trunk, I added a "solrindex" command to bin/nutch. Also, since we don't have an indexDocNo anymore, response-xml and response-json are altered slightly to return "indexkey" instead. Functionally for lucene indexes, this should be the same as indexDocNo.
Doğacan Güney made changes - 12/Jan/09 01:27 PM
To julien nioche:
I missed your comment about speculative execution. I think it is a good idea and I will add it as a separate commit. Aaand one more thing:
I just deleted TestDistributedSearch for now as distributed search changed too much. For example, there is no DistributedSearch.Client anymore. Later, we can figure out how to add it back on.
Integrated in Nutch-trunk #691 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/691/
Two more changes:
closing issues for released version
Sami Siren made changes - 27/Mar/09 08:28 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Patch consists of two parts:
1) Support for multiple indexing backends: See
NUTCH-520for details. This patch includes the latest patch fromNUTCH-520.2) Support for multiple search backends:
This division may seem arbitrary (and it actually is), however these abstractions are useful enough that Solr and nutch's search server can work. If later further abstractions are needed for new search backends, they can be added.
This division also has a nice side effect: Currently, an search server searches lucene indexes and generate summaries for results. After this patch, it is now possible to start a search server that searches an index and a 'segment server' (that returns cached content of pages, generates summaries, etc.) seperately. DistributedSearch$IndexServer (uses LuceneSearchBean) and DistributedSearch$SegmentServer (uses FetchedSegments) classes are added for this.
SearchBean (extends Searcher, HitDetailer)
RPCSearchBean (extends SearchBean, VersionedProtocol)
LuceneSearchBean (implements RPCSearchBean, searches lucene indexes (may be local or on dfs), can also respond to RPC requests)
SolrSearchBean (implements SearchBean, processes responses from a SOLR server)
DistributedSearchBean (implements SearchBean, is also a container of SearchBeans. This class implements the searching part of DistributedSearch$Client. Sends parallel connections to multiple beans and merges their results. Does not use RPC.call API (since not all beans support hadoop's RPC), instead uses a modern threading pool for parallel requests.
SegmentBean (extends HitContents, HitSummarizer)
RPCSegmentBean (extends SegmentBean, VersionedProtocol),
FetchedSegments (is similar to older version)
Sorry, if the description is a bit complex (however, code itsef should be easy to understand) . Comments, suggestions, reviews and all other sorts of feedback are welcome.