Issue Details (XML | Word | Printable)

Key: NUTCH-442
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Doğacan Güney
Reporter: rubdabadub
Votes: 17
Watchers: 20
Operations

If you were logged in you would be able to see more operations.
Nutch

Integrate Solr/Nutch

Created: 07/Feb/07 06:37 PM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: indexer, searcher
Affects Version/s: None
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works Crawl.patch 2008-05-12 08:19 PM Caspar MacRae 3 kB
Text File Licensed for inclusion in ASF works Indexer.patch 2008-05-14 02:04 AM Caspar MacRae 15 kB
Text File Licensed for inclusion in ASF works NUTCH-442_v4.patch 2007-11-19 09:12 PM Doğacan Güney 182 kB
Text File Licensed for inclusion in ASF works NUTCH-442_v5.patch 2008-04-17 02:14 PM Doğacan Güney 178 kB
Text File Licensed for inclusion in ASF works NUTCH-442_v6.patch.txt 2008-06-22 03:24 PM Julien Nioche 181 kB
Text File Licensed for inclusion in ASF works NUTCH-442_v7.patch.txt 2008-08-05 07:57 AM Guillaume Smet 188 kB
Text File Licensed for inclusion in ASF works NUTCH-442_v7a.patch.txt 2008-09-19 03:53 PM Nick Tkach 183 kB
Text File Licensed for inclusion in ASF works NUTCH-442_v8.patch 2008-10-09 12:00 PM Doğacan Güney 192 kB
Text File Licensed for inclusion in ASF works NUTCH_442_v3.patch 2007-08-06 08:47 AM Doğacan Güney 196 kB
Text File Licensed for inclusion in ASF works RFC_multiple_search_backends.patch 2007-07-31 01:19 PM Doğacan Güney 158 kB
XML File Licensed for inclusion in ASF works schema.xml 2007-07-31 02:01 PM Doğacan Güney 2 kB
Environment: Ubuntu linux

Resolution Date: 12/Jan/09 01:27 PM


 Description  « Hide
Hi:

After trying out Sami's patch regarding Solr/Nutch. Can be found here (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) and I can confirm it worked And that lead me to request the following :

I would be very very great full if this could be included in nutch 0.9 as I am trying to eliminate my python based crawler which post documents to solr. As I am in the corporate enviornment I can't install trunk version in the production enviornment thus I am asking this to be included in 0.9 release. I hope my wish would be granted.

I look forward to get some feedback.

Thank you.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doğacan Güney added a comment - 31/Jul/07 01:19 PM
Here is my (very large - sorry) patch for this issue:

Patch consists of two parts:

1) Support for multiple indexing backends: See NUTCH-520 for details. This patch includes the latest patch from NUTCH-520.

2) Support for multiple search backends:

  • DistributedSearch.Client is removed.
  • Search is divided into three main parts:
  • SearchBean: implements Searcher and HitDetailer
  • SegmentBean: implements HitContent and HitSummarizer
  • HitInlinks: same as old

This division may seem arbitrary (and it actually is), however these abstractions are useful enough that Solr and nutch's search server can work. If later further abstractions are needed for new search backends, they can be added.

This division also has a nice side effect: Currently, an search server searches lucene indexes and generate summaries for results. After this patch, it is now possible to start a search server that searches an index and a 'segment server' (that returns cached content of pages, generates summaries, etc.) seperately. DistributedSearch$IndexServer (uses LuceneSearchBean) and DistributedSearch$SegmentServer (uses FetchedSegments) classes are added for this.

  • SearchBean hierarchy is like this:

SearchBean (extends Searcher, HitDetailer)

RPCSearchBean (extends SearchBean, VersionedProtocol)

LuceneSearchBean (implements RPCSearchBean, searches lucene indexes (may be local or on dfs), can also respond to RPC requests)

SolrSearchBean (implements SearchBean, processes responses from a SOLR server)

DistributedSearchBean (implements SearchBean, is also a container of SearchBeans. This class implements the searching part of DistributedSearch$Client. Sends parallel connections to multiple beans and merges their results. Does not use RPC.call API (since not all beans support hadoop's RPC), instead uses a modern threading pool for parallel requests.

  • Location of remote nutch/lucene servers are still read from crawl/search-servers.txt. Location of solr servers are read from crawl/solr-servers.txt (yes, it supports searching from more than 1 solr servers).
  • DistributedSearchBean routinely sends pings to its beans. If a bean fails to respond, it is removed from active list of search servers (so that it doesn't block searching). For example, if solr server dies, DistributedSearchBean realizes this and stops sending search requests to solr server. Later when solr comes back up, DistributedSearchBean re-adds it to active search server list.
  • SegmentBean is similar:

SegmentBean (extends HitContents, HitSummarizer)

RPCSegmentBean (extends SegmentBean, VersionedProtocol),

FetchedSegments (is similar to older version)

  • DistributedSearch$SegmentServer (which uses FetchedSegments internally) reads its config from crawl/segment-servers.txt .
  • I also added a couple of utility classes for sending requests to solr and processing responses (under o.a.n.util.solr)

Sorry, if the description is a bit complex (however, code itsef should be easy to understand) . Comments, suggestions, reviews and all other sorts of feedback are welcome.


Doğacan Güney made changes - 31/Jul/07 01:19 PM
Field Original Value New Value
Attachment RFC_multiple_search_backends.patch [ 12362875 ]
Doğacan Güney added a comment - 31/Jul/07 02:01 PM
This simple schema can be used to test solr integration.

Doğacan Güney made changes - 31/Jul/07 02:01 PM
Attachment schema.xml [ 12362879 ]
Doğacan Güney added a comment - 02/Aug/07 11:54 AM
Btw, after this patch it is possible to mix different type of servers in a search application. For example, a(ny number of) solr search server(s), an old index/summary server, an index server and a summary server can work side-by-side in a single application.

Doğacan Güney added a comment - 06/Aug/07 08:45 AM
New version.
  • Properly reset internal state of SolrResponseHandler between parses.
  • Remove unused imports in ScoringFilters.
  • Reduce number of warnings in o.a.n.searcher.

IndexSearcher: Remove deprecated FSDirectory API usage. Remove deprecated FileSystem API usage.
LuceneQueryOptimizer: Remove usage of lucene's QueryFilter.
Everything: Suppress serial UID warnings and use generics.

  • Better handling of Hit results in DistributedSearchBean.

Doğacan Güney made changes - 06/Aug/07 08:45 AM
Attachment NUTCH_442_v2.patch [ 12363223 ]
Doğacan Güney added a comment - 06/Aug/07 08:47 AM
Oops, attached wrong file. Here is the correct one.

Doğacan Güney made changes - 06/Aug/07 08:47 AM
Attachment NUTCH_442_v3.patch [ 12363224 ]
Doğacan Güney made changes - 06/Aug/07 08:48 AM
Attachment NUTCH_442_v2.patch [ 12363223 ]
Enis Soztutar added a comment - 15/Oct/07 03:33 PM
Using nutch with solr has been a very demanding request, so it will be very useful when this makes into trunk. I have spend some time reviewing the patch, which I find quite elegant.

Some improvements to the patch would be

  • make NutchDocument implement VersionedWritable instead of writable, and delegate version checking to superclass
  • refactor getDetails() methods in HitDetailer to Searcher (it is not likely that a class would implement Searcher but not HitDetailer)
  • use Searcher, delete HitDetailer and SearchBean
  • Rename XXXBean classes so that they do not include "bean". (I think it is confusing to have bean objects that have non-trivial functionality)
  • refactor LuceneSearchBean.VERSION to RPCSearchBean
  • remove unrelated changes from the patch.(the changes in NGramProfile, HTMLLanguageParser,LanguageIdentifier,... correct me if i'm wrong)

As far as i can see, we do not need any metadata for Solr backend, and only need Store,Index and Vector options for lucene backend, so i think we can simplify NutchDocument#metadata. We may implement :

class FieldMeta {
o.a.l.document.Field.Store store;
o.a.l.document.Field.Index index;
o.a.l.document.Field.TermVector tv;
}

FieldMeta[] IndexingFilter.getFields();

class NutchDocument {
...
private ArrayList<Field> fieldMeta;
...
}

Or alternatively we may wish to keep add methods of NutchDocument compatible with o.a.l.document.Document, keeping the metadata up-to-date as we add new fields, using this info at LuceneWriter, but ignoring in SolrWriter. This will be slightly slower but the API will be much more intuitive.


Enis Soztutar added a comment - 26/Oct/07 01:20 PM
Due to the method signature bug (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6267833) for ExecutorService#invokeAll the patch will not compile against 1.5. We should manage the lists as List<Callable<T>>.

Doğacan Güney added a comment - 01/Nov/07 01:45 PM
> make NutchDocument implement VersionedWritable instead of writable, and delegate version checking to superclass

I never quite understood how VersionedWritable is supposed to work. It only checks for a version then throws an exception. If you want your class to behave differently you have to write custom code anyway.

> refactor getDetails() methods in HitDetailer to Searcher (it is not likely that a class would implement Searcher but not HitDetailer)
> use Searcher, delete HitDetailer and SearchBean
> Rename XXXBean classes so that they do not include "bean". (I think it is confusing to have bean objects that have non-trivial functionality)

I agree with these three but I would like to get feedback from others before making this change.

> refactor LuceneSearchBean.VERSION to RPCSearchBean

We can't do this because there may be other RPCSearchBean-s besides LuceneSearchBean. So VERSION has to be redefined in every class that implements RPCSearchBean.

> remove unrelated changes from the patch.(the changes in NGramProfile, HTMLLanguageParser,LanguageIdentifier,... correct me if i'm wrong)

Correct... I will send an updated patch (that also includes Java 5 ExecutorService fixes)

> As far as i can see, we do not need any metadata for Solr backend, and only need Store,Index and Vector options for lucene backend, so i think we can simplify NutchDocument#metadata. We may implement : [...]

I really don't want to make NutchDocument depend on Lucene. So I would prefer that FieldMeta doesn't depend on Lucene data structures. Because it is possible to implement a non-lucene backend (say, you might want to index to a database)


Otis Gospodnetic added a comment - 18/Nov/07 10:28 PM
Doğacan - your comments sound good and I'd guess "bean" piece should be stripped.
Do you have a new version of the patch that applies to the trunk? I'd love to see this in soon, as I will need the Nutch+Solr combination in a few weeks (early December)....yeah, this would be a lovely Christmas present!

Doğacan Güney added a comment - 19/Nov/07 09:03 PM
Hi Otis,

I have a new version of patch for trunk(I have also made some other changes), but it is not fully tested yet. Still I am going to post it soon so others can take a look at it.


Doğacan Güney added a comment - 19/Nov/07 09:12 PM - edited
Here is the latest patch. I can tell you that it compiles but I don't know if it runs or not

I don't have time to describe all the new things in the patch, but here is what my git-log command tells me: (Edit: I should add: I will describe changes properly later, I just don't have the time right now

  • Ported to latest trunk
  • Revert unnecessary changes to languageidentifier.
  • Java 5 compatibility fixes.
  • Don't use Override tags for "implement"ed methods.
  • Ugly hack for stringifying boolean queries.

Adds a static stringify(BooleanQuery) method to SolrSearchBean. We can't just call
BooleanQuery.toString() because it can produce strings like url:http://www.google.com
which is not a valid solr query string. Method stringify produces them like
url:"http://www.google.com".

  • Update segment names periodically in FetchedSegments.

Added a new monitoring thread to FetchedSegments that watch
for changes in the given segments directory and updates map
'segments' accordingly. Map 'segments' is changed to a ConcurrentMap
for thread-safety.

  • SolrResponseHandler updates.
  • Don't buffer characters in SolrResponseHandler unless necessary.
  • Use StringBuilder.setLength(0) instead of creating a new StringBuilder.

Doğacan Güney made changes - 19/Nov/07 09:12 PM
Attachment NUTCH-442_v4.patch [ 12369823 ]
Tomislav Poljak added a comment - 28/Nov/07 10:13 AM
Hi Doğacan,
is it safe to apply this patch to Nutch/Hadoop running on multiple machines (Nutch master/slave with >1 nodes)?

Doğacan Güney added a comment - 29/Nov/07 08:05 AM
Tomislav, I am not sure what you mean.. I wouldn't yet put this patch in a production environment because it may blow up your computers, etc.. but this patch is designed to work in any configuration. So if yours is a development environment then this patch should work fine in your configuration.

Otis Gospodnetic added a comment - 02/Dec/07 09:36 PM
Doğacan – can you please explain what you mean by "blog up your computers"?

I didn't look at your latest patch yet, but I thought the Nutch-side of the patch was read-only...no?


Doğacan Güney added a comment - 02/Dec/07 10:59 PM
"blow up your computers" may be a bit of exaggeration

What I mean is that I have tested this patch myself, but it is a big and intrusive one so it can mess with your Solr index. I am not sure what you mean by "read-only" but a large portion of this patch deals with indexing to Solr. So it is not read-only in that sense.


Otis Gospodnetic added a comment - 03/Dec/07 06:44 AM
Doğacan - ah, good!

The Nutch side of the functionality included in this patch is "read-only". The Solr side is the only side where writes happen. I'm a lot happier to hear that the blowing up is really just messing with Solr/Lucene indices - that is a lot easier to deal with than something going wrong on the Nutch/Hadoop side, at least in my book.

Thanks for the clarification!


Otis Gospodnetic added a comment - 14/Apr/08 08:22 PM
This issue has a lot of votes and a lot of watchers. I'm ready to try this out with a good-sized Nutch+Solr installation and so will be able to see if anything really blows up or if this patch is good to commit.

The patch currently fails. Doğacan, do you happen to have an up to date version?


Doğacan Güney added a comment - 15/Apr/08 04:50 PM
Hi,

Sorry I have been out of loop for a long time.

I am working on bringing patch up-to-date with trunk. Should have one ready in a few days.


Doğacan Güney added a comment - 17/Apr/08 02:14 PM
Here is an patch that should apply against latest trunk. I tested it with a small crawl and it seems to work.

There are two small issues: Unit tests do not compile yet (TestDistributedSearch changed in an incompatible way, I will fix this in a follow-up patch), also I had to pull the patch that reloads search-servers.txt file if it is changed (I think it can be applied on top of this patch, so again follow-up patch. Still, this should be useful enough to review and test.


Doğacan Güney made changes - 17/Apr/08 02:14 PM
Attachment NUTCH-442_v5.patch [ 12380395 ]
Doğacan Güney added a comment - 27/Apr/08 09:39 AM
Hey everyone,

I guess I shouldn't be saying this since I have been gone such a long time, but can I get some reviews/testing/suggestion on this patch? I know it is huge (sorry about that, but I would really love to see this one committed in before 1.0....


Caspar MacRae added a comment - 12/May/08 08:19 PM
Hi,

Tried with nutch svn rev: 650750 and solr svn rev: 652571 - has been working perfectly for around a week in dev env

Attaching a trivial change for http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java

just adds the -solr argument.

thanks,
Caspar


Caspar MacRae made changes - 12/May/08 08:19 PM
Attachment Crawl.patch [ 12381906 ]
Caspar MacRae added a comment - 14/May/08 02:04 AM

Forgot to add; Indexer wasn't cleaning up tmp working directory (also appended random number to tmp dir name following style of CrawlDbMerger etc)

Caspar MacRae made changes - 14/May/08 02:04 AM
Attachment Indexer.patch [ 12382008 ]
Doğacan Güney added a comment - 18/May/08 01:06 PM
Hi Caspar,

Thanks for the patches and testing! I will merge your changes in a new patch

For future reference, can you send your patches as diffs (preferably against latest patch) instead of sending whole file?


Vladimir Garvardt added a comment - 21/Jun/08 01:32 PM
Hello.

I'm trying to apply this patch and faced a problem that I cannot solve by myself.

I checked out nutch trunk (rev 670194), downloaded attachments from this issue and started patching.
First I applied Crawl.patch, then Indexer.patch and then NUTCH-442_v5.patch. On applying last patch I got warning message. This happened because of conflict between Crawl.patch and NUTCH-442_v5.patch.

Crawl.patch performs the following action:
// index, dedup & merge
+ indexer.index(indexes, solrUrl, crawlDb, linkDb,
+ Arrays.asList(fs.listPaths(segments, HadoopFSUtil.getPassAllFilter())));

and NUTCH-442_v5.patch performs the following action
// index, dedup & merge

  • indexer.index(indexes, crawlDb, linkDb, fs.listPaths(segments, HadoopFSUtil.getPassAllFilter()));
    + indexer.index(indexes, null, crawlDb, linkDb,
    + Arrays.asList(fs.listPaths(segments, HadoopFSUtil.getPassAllFilter())));

The main between this patches in second parameter.
First I tried to build nutch with second parameter set to null - crawling finished successfully, but no data was added to solr.
Then I changed second parameter to solrUrl and rebuilt nutch. On indexing following Exception was caught and indexing failed (no data in solr):
Indexer: starting
Indexer: crawldb: crawl/crawldb
Indexer: linkdb: crawl/linkdb
Indexer: solrUrl: http://localhost:8984/solr/
Indexer: adding segment: file:/home/vladimirga/Documents/dev/src/lucene-src/nutch-2008-06-21/wrk-01/crawl/segments/20080621200352
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:318)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:148)

What can cause that problem and how can I fix it to make nutch index into solr?

Thanks.


Julien Nioche added a comment - 22/Jun/08 03:24 PM
Merged patches for Indexer and Crawler with 442_v5

Julien Nioche made changes - 22/Jun/08 03:24 PM
Attachment NUTCH-442_v6.patch.txt [ 12384452 ]
James Tan made changes - 30/Jul/08 07:51 AM
Comment [ I am facing the same issue that Vladimir Garvardt got above. Please see below. I basically check out the latest nutch version((Revision 680683) from http://svn.apache.org/repos/asf/lucene/nutch/trunk/ then apply only patch442_v6.patch. Do I need to apply any of the earlier patches with the latest nutch version(Revision 680683). Can anybody please advise on this? Thanks in advance!

.....
Indexer: starting
Indexer: crawldb: crawl.test/crawldb
Indexer: linkdb: crawl.test/linkdb
Indexer: solrUrl: http://localhost:8983/solr/
Indexer: adding segment: file:/nutch-solr/nutch-trunk/crawl.test/segments/20080729183600
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:319)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:148) ]
James Tan made changes - 30/Jul/08 07:52 AM
Comment [ Please disregard comment below. I am able to get it to work now. Thanks ]
Guillaume Smet added a comment - 05/Aug/08 07:57 AM
Here is an updated patch synced with SVN trunk.

A few thoughts and comments:

  • in Indexer.java, I added a try{} catch {} finally {} around JobClient.runJob(job); so that we're sure to cleanup the temp directory
  • why did you remove the try {} catch {} in the *IndexingFilter classes?
  • otherwise, it seems to work as expected and post the documents to Solr (be sure to add a string field called boost in your Solr schema).
  • from what I've seen, it doesn't deal with removing the documents from the index when gone
  • perhaps using Solrj to communicate with Solr instead of a custom class would be better

Thanks for your work.


Guillaume Smet made changes - 05/Aug/08 07:57 AM
Attachment NUTCH-442_v7.patch.txt [ 12387544 ]
Nick Tkach added a comment - 19/Sep/08 03:53 PM
Took the liberty of converting the absolute filenames to relative names so you can apply the patch from the Nutch base install directory. Otherwise no changes.

Nick Tkach made changes - 19/Sep/08 03:53 PM
Attachment NUTCH-442_v7a.patch.txt [ 12390519 ]
Dmitry Grinberg added a comment - 07/Oct/08 05:53 AM
What needs to be done for this to make it into the trunk? I mean, any additional development or testing? Maybe I can be of use.

Doğacan Güney added a comment - 07/Oct/08 08:55 AM
Thanks to everyone for comments. Unfortunately this patch will probably have to wait until after 1.0 to get in.

But since many people are interested in having some sort of Solr integration in trunk maybe we can update Sami Siren's solr patch and commit it for 1.0.

What do others think?


Enis Soztutar added a comment - 07/Oct/08 02:00 PM
I personally believe this patch should be in before 1.0, since it does not make sense to make such a change in 1.1. However since there is some need to test this patch more thoroughly, I guess we can make a branch and commit it there, so that people can test this easily. However branching has it's own problems, especially keeping in sync with trunk would get harder and harder.

Since this issue has a large number of votes and watchers, I suggest we branch and commit it, test this out a little bit more, and merge to trunk before 1.0.


Andrzej Bialecki added a comment - 07/Oct/08 02:38 PM
+1 on adding this before 1.0 - it would be a shame to miss this functionality when it's been asked for over and over. One change that should be made (either in this patch or as a follow-up) is to use SolrJ instead of plain HTTP.

I don't think we need to branch for this - as long as the patch passes tests and runs basic commands IMHO it's good enough to expose a wider audience to it. Applying this to trunk/ actually gives us better chances that it will be tested by more people.


Doğacan Güney added a comment - 07/Oct/08 04:49 PM
Great!

(I am obviously +1 on adding this before 1.0 )

So, can I get some reviews on what people think of this patch then?

On solrj: I will send an updated patch that uses solrj instead.


Doğacan Güney made changes - 09/Oct/08 11:56 AM
Assignee Doğacan Güney [ dogacan ]
Doğacan Güney made changes - 09/Oct/08 11:57 AM
Component/s searcher [ 11593 ]
Component/s indexer [ 11592 ]
Fix Version/s 1.0.0 [ 12312443 ]
Doğacan Güney added a comment - 09/Oct/08 12:00 PM
Latest patch.
  • Uses solrj, so you know need apache-solr-solrj and apache-solr-common under lib/
  • I deleted TestDistributedSearch for now since that test case is no longer applicable
  • I also separated lucene and solr indexers. I added two classes IndexerMapReduce and IndexerOutputFormat that deals with the bulk of indexing. So now if you want to write a new indexer
    you just write a new indexer class (examples: Indexer.java and SolrIndexer.java).
  • Also added a NutchIndexWriterFactory to generate NutchIndexWriter-s not sure if this is the right approach.

Doğacan Güney made changes - 09/Oct/08 12:00 PM
Attachment NUTCH-442_v8.patch [ 12391810 ]
Julien Nioche added a comment - 09/Oct/08 03:59 PM
Hi Dogacan,

I have tested your patch and did not have any problems with it. Any chance it could get included in the trunk soon?

Thanks! this is an awesome improvement for Nutch

Julien


Felix Z. added a comment - 10/Oct/08 01:28 AM
Hi,

really great! I use the NUTCH/SOLR - integration patches for a theme-specific search engine and I could never reach this goal without the excellent work. But since the latest patch (v.8), I am not able to put the fetched documents into solr. The build is OK, apache-solr-solrj and apache-solr-common are in /lib, so everything should be fine so far. Do I have to configure solrj or to add something like "nutch index -solr http:...." like in patch v.7?

I got this error:
Indexer: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/opt/nutch/crawls/0744/3/18/crawlData/crawldb already exists

Thanks for help
Felix.


Doğacan Güney added a comment - 10/Oct/08 06:45 AM
Hi Felix,

I have changed the latest patch a bit, that is why you are getting an error

Specifically, I have separated solr and lucene indexing mechanisms so now you should run:

bin/nutch org.apache.nutch.indexer.solr.SolrIndexer <solr url> <crawldb> <linkdb> <segment> .....

Of course, before committing we can add a short hand to nutch script so you can run:

bin/nutch indexsolr ....


Felix Z. added a comment - 31/Oct/08 01:46 PM - edited
Hi everybody,

1. in SOLR, the field "cache" is empty. If the NUTCH/SOLR-Integration does not provide this, how can I put the full (and not parsed) html content into my SOLR-Database? I use patch v.8.

2. Is it possible to index a single character, especially "§" (paragraf), with SOLR? Is it only a SOLR-thing or do I additionally have to change something in NUTCH-parser?

Thanks for help
Felix.


Julien Nioche added a comment - 13/Nov/08 09:36 AM
In the class SOLRIndexer.java should we specify :

job.setReduceSpeculativeExecution(false);

around the line 48?

otherwise we might have several attempts for a reduce task running at the same time and sending the same documents to the same SOLR instance which is likely to slow down the indexing. SpeculativeExecution does not make the indexing safer as these attempts would all crash in the same way if they receive a SOLRException.


Tony Wang added a comment - 10/Jan/09 03:39 AM
Hello everyone!

I am trying to integrate Nutch with Solr by applying the NUTCH-442_v8.patch file. But not much successful in the patching process. See below:

The text below shown in red is my input on the SSH client window:

I've just downloaded NUTCH-442_v8.patch from https://issues.apache.org/jira/browse/NUTCH-442, but the patching process gave me lots of errors, see below:

webby88 /opt/tomcat6/webapps/nutch: patch < NUTCH-442_v8.patch (Is this right to apply patches in Linux CentOS 5.2?)

The next patch would delete the file TestDistributedSearch.java,
which does not exist! Assume -R? [n]
Apply anyway? [n] y (I chose yes)
can't find file to patch at input line 5
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------

Index: src/test/org/apache/nutch/
searcher/TestDistributedSearch.java
===================================================================
— src/test/org/apache/nutch/searcher/TestDistributedSearch.java (revision 701044)
+++ src/test/org/apache/nutch/searcher/TestDistributedSearch.java (working copy)
--------------------------
File to patch:
Skip this patch? [y] n
File to patch: src/test/org/apache/nutch/searcher/TestDistributedSearch.java (I copied the path from the revision 701044 to here)
patching file src/test/org/apache/nutch/searcher/TestDistributedSearch.java
can't find file to patch at input line 154
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------
Index: src/test/org/apache/nutch/indexer/TestIndexingFilters.java
===================================================================
— src/test/org/apache/nutch/indexer/TestIndexingFilters.java (revision 701044)
+++ src/test/org/apache/nutch/indexer/TestIndexingFilters.java (working copy)
--------------------------
File to patch: src/test/org/apache/nutch/indexer/TestIndexingFilters.java (I copied the path from the revision 701044 to here)

Too many similar 'file cannot be found' errors here, so errors cut off.

File to patch:
Skip this patch? [y] y
Skipping patch.
11 out of 11 hunks ignored
patching file build.xml

When I tried to run 'ant war' in the nutch installation directory, I got this error:

BUILD FAILED
/opt/tomcat6/webapps/nutch/build.xml:107: Compile failed; see the compiler error output for details.

I wonder if my way of applying this patch is correct or not. Could you please give me some correction if I did wrong? My system is CentOS 5.2 by the way.


Aaron Hammond added a comment - 10/Jan/09 03:46 AM
I will be out of the office until January 26th.

Repository Revision Date User Message
ASF #733738 Mon Jan 12 13:26:16 UTC 2009 dogacan NUTCH-442 - Integrate Solr/Nutch
Files Changed
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/NutchBean.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneSearchBean.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/SolrSearchBean.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/FetchedSegments.java
MODIFY /lucene/nutch/trunk/src/plugin/response-xml/src/java/org/apache/nutch/searcher/response/xml/XMLResponseWriter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/net/URLFilters.java
ADD /lucene/nutch/trunk/lib/apache-solr-common-1.3.0.jar
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilter.java
MODIFY /lucene/nutch/trunk/src/web/jsp/anchors.jsp
ADD /lucene/nutch/trunk/lib/apache-solr-solrj-1.3.0.jar
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrConstants.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilters.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/Summary.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/RPCSegmentBean.java
MODIFY /lucene/nutch/trunk/src/web/jsp/cached.jsp
MODIFY /lucene/nutch/trunk/build.xml
MODIFY /lucene/nutch/trunk/src/plugin/feed/src/test/org/apache/nutch/parse/feed/TestFeedParser.java
MODIFY /lucene/nutch/trunk/src/plugin/tld/src/java/org/apache/nutch/scoring/tld/TLDScoringFilter.java
MODIFY /lucene/nutch/trunk/src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
MODIFY /lucene/nutch/trunk/src/web/jsp/search.jsp
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/RPCSearchBean.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearchBean.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Inlinks.java
MODIFY /lucene/nutch/trunk/src/web/jsp/explain.jsp
MODIFY /lucene/nutch/trunk/CHANGES.txt
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
MODIFY /lucene/nutch/trunk/bin/nutch
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/QueryException.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/lucene
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/NutchDocument.java
MODIFY /lucene/nutch/trunk/src/plugin/build.xml
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/Hits.java
MODIFY /lucene/nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/indexer/TestIndexingFilters.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/lucene/LuceneWriter.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/SearchBean.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
MODIFY /lucene/nutch/trunk/src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/QueryFilters.java
MODIFY /lucene/nutch/trunk/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag/RelTagIndexingFilter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/OpenSearchServlet.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/Hit.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/lucene/LuceneConstants.java
MODIFY /lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/NGramProfile.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/IndexSearcher.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/servlet/Cached.java
MODIFY /lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/LanguageIdentifier.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java
MODIFY /lucene/nutch/trunk/src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java
MODIFY /lucene/nutch/trunk/src/plugin/tld/src/java/org/apache/nutch/indexer/tld/TLDIndexingFilter.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/SegmentBean.java
MODIFY /lucene/nutch/trunk/src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java
MODIFY /lucene/nutch/trunk/src/plugin/response-json/src/java/org/apache/nutch/searcher/response/json/JSONResponseWriter.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
MODIFY /lucene/nutch/trunk/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/HitDetails.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
MODIFY /lucene/nutch/trunk/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilter.java
MODIFY /lucene/nutch/trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java
MODIFY /lucene/nutch/trunk/src/plugin/languageidentifier/src/test/org/apache/nutch/analysis/lang/TestNGramProfile.java
MODIFY /lucene/nutch/trunk/src/plugin/creativecommons/src/java/org/creativecommons/nutch/CCIndexingFilter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/Query.java
MODIFY /lucene/nutch/trunk/src/plugin/languageidentifier/src/test/org/apache/nutch/analysis/lang/TestLanguageIdentifier.java
MODIFY /lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java

Doğacan Güney added a comment - 12/Jan/09 01:26 PM
Committed in rev. 733738.

Besides syncing with trunk, I added a "solrindex" command to bin/nutch.

Also, since we don't have an indexDocNo anymore, response-xml and response-json are altered slightly to return "indexkey" instead. Functionally for lucene indexes, this should be the same as indexDocNo.


Doğacan Güney made changes - 12/Jan/09 01:27 PM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Repository Revision Date User Message
ASF #733744 Mon Jan 12 13:30:28 UTC 2009 dogacan Unrelated change went in accidentally in NUTCH-442. Reverting to old version.
Files Changed
MODIFY /lucene/nutch/trunk/src/plugin/build.xml

Doğacan Güney added a comment - 12/Jan/09 05:23 PM
To julien nioche:

I missed your comment about speculative execution. I think it is a good idea and I will add it as a separate commit.


Doğacan Güney added a comment - 12/Jan/09 05:30 PM
Aaand one more thing:

I just deleted TestDistributedSearch for now as distributed search changed too much. For example, there is no DistributedSearch.Client anymore. Later, we can figure out how to add it back on.


Repository Revision Date User Message
ASF #733848 Mon Jan 12 17:33:16 UTC 2009 dogacan Two more NUTCH-442 changes:

* Delete TestDistributedSearch for now
* Set reduceSpeculativeExecution false for SolrIndexer
Files Changed
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
DEL /lucene/nutch/trunk/src/test/org/apache/nutch/searcher/TestDistributedSearch.java

Hudson added a comment - 13/Jan/09 04:16 AM
Integrated in Nutch-trunk #691 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/691/)
Two more changes:
  • Delete TestDistributedSearch for now
  • Set reduceSpeculativeExecution false for SolrIndexer
    Unrelated change went in accidentally in . Reverting to old version.
  • Integrate Solr/Nutch

Sami Siren added a comment - 27/Mar/09 08:28 PM
closing issues for released version

Sami Siren made changes - 27/Mar/09 08:28 PM
Status Resolved [ 5 ] Closed [ 6 ]