[SOLR-1301] Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.7, 6.0
Component/s: None
Labels:
None

Description

This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold:

provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network.

Design
----------

Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.

The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken.

This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard.

An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead.

This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib.

Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SolrRecordWriter.java
13/Sep/09 01:43
10 kB
Jason Venner (www.prohadoop.com)
SOLR-1301-maven-intellij.patch
27/Nov/13 00:27
39 kB
Steven Rowe
SOLR-1301-hadoop-0-20.patch
27/Apr/10 23:53
57 kB
Alexander Kanarsky
SOLR-1301-hadoop-0-20.patch
28/Apr/10 01:23
57 kB
Matt Revelle
SOLR-1301.patch
10/Sep/09 17:30
33 kB
Jason Rutherglen
SOLR-1301.patch
10/Sep/09 18:46
34 kB
Jason Rutherglen
SOLR-1301.patch
11/Sep/09 18:23
34 kB
Kris Jirapinyo
SOLR-1301.patch
24/Sep/09 03:25
60 kB
Jason Rutherglen
SOLR-1301.patch
19/Oct/09 23:19
61 kB
Jason Rutherglen
SOLR-1301.patch
01/Feb/10 03:41
61 kB
Jason Rutherglen
SOLR-1301.patch
02/Feb/10 16:57
61 kB
Jason Rutherglen
SOLR-1301.patch
20/Sep/10 08:40
64 kB
Alexander Kanarsky
SOLR-1301.patch
18/Oct/10 22:28
64 kB
Alexander Kanarsky
SOLR-1301.patch
22/Feb/12 23:19
58 kB
Alexander Kanarsky
SOLR-1301.patch
25/Mar/12 00:17
58 kB
Greg Bowyer
SOLR-1301.patch
31/Aug/13 21:32
963 kB
Mark Miller
SOLR-1301.patch
10/Sep/13 23:39
2.33 MB
Mark Miller
SOLR-1301.patch
13/Sep/13 05:33
2.44 MB
Mark Miller
SOLR-1301.patch
14/Sep/13 21:14
2.47 MB
Mark Miller
SOLR-1301.patch
16/Sep/13 21:15
2.49 MB
Mark Miller
SOLR-1301.patch
14/Oct/13 01:15
4.59 MB
Mark Miller
SOLR-1301.patch
22/Nov/13 05:52
4.59 MB
Mark Miller
README.txt
13/Sep/09 01:44
3 kB
Jason Venner (www.prohadoop.com)
log4j-1.2.15.jar
24/Sep/09 17:44
383 kB
Jason Rutherglen
hadoop-core-0.20.2-cdh3u3.jar
22/Feb/12 23:24
3.43 MB
Alexander Kanarsky
hadoop-0.20.1-core.jar
06/May/10 00:35
2.56 MB
Jason Rutherglen
hadoop-0.19.1-core.jar
22/Jul/09 11:28
2.27 MB
Andrzej Bialecki
hadoop.patch
22/Jul/09 11:28
28 kB
Andrzej Bialecki
commons-logging-api-1.0.4.jar
24/Sep/09 03:25
26 kB
Jason Rutherglen
commons-logging-1.0.4.jar
24/Sep/09 03:25
37 kB
Jason Rutherglen

Issue Links

is duplicated by

SOLR-1045 Build Solr index using Hadoop MapReduce

Closed

is related to

SOLR-5667 Performance problem when not using hdfs block cache.

Closed

SOLR-6212 upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected under java 8/9 with 9.5.1-4

Closed

SOLR-1457 Deploy shards from HDFS into local cores

Closed

SOLR-5758 need ref guide doc on building indexes with mapreduce (morphlines-cell contrib)

Closed

relates to

SOLR-1045 Build Solr index using Hadoop MapReduce

Closed

(1 relates to)

Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates