[NUTCH-1445] Add ElasticIndexerJob that indexes to elasticsearch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.1
Component/s: None
Labels:
None

Description

We have created a new indexer job ElasticIndexerJob that indexes to elasticsearch. It is orginally based upon https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2 license), but we have modified it greatly to make it integrate as good as possible into Nutch. The greatest modification is that documents are asynchronously flushed in bulk to elasticsearch.

Elasticsearch rocks. Both performance and ease of confiugration is awesome. You simply deploy a server by unpacking the tar, configure the clustername, start the server and fire away indexing requests. Indices are automatically created. Fields are automapped. (Of course it is recommended to create your own optimized mapping, but that is beyond scope of this issue). Multiple servers connect without extra configuration, simply by using the same clustername. (By means of multicast). There a tons of advanced options, such as sharding, replication, disk striping etc.

To give an example of the performance: With 20+ nodes we are able to index over 1M docs (average sized webdocuments) per minute. The best part is that the added documents are almost instantly searchable, so there no hidden commit costs that Solr has. This is with out-of-the-box configuration.

(I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1445-addToNutchScript.patch
01/Aug/12 14:36
1.0 kB
Ferdy
NUTCH-1445-addPropsToConfig.patch
03/Aug/12 15:08
1.0 kB
Ferdy
NUTCH-1445.patch
01/Aug/12 14:22
13 kB
Ferdy

Issue Links

relates to

NUTCH-1462 Elasticsearch not indexing when type==null in NutchDocument metadata

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Ferdy

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Aug/12 14:12

Updated:: 31/Aug/12 12:41

Resolved:: 03/Aug/12 15:08