Solr
  1. Solr
  2. SOLR-1301

Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, Trunk
    • Component/s: None
    • Labels:
      None

      Description

      This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold:

      • provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
      • avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network.

      Design
      ----------

      Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.

      The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken.

      This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard.

      An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead.

      This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib.

      Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License.

      1. SOLR-1301-maven-intellij.patch
        39 kB
        Steve Rowe
      2. SOLR-1301.patch
        4.59 MB
        Mark Miller
      3. SOLR-1301.patch
        4.59 MB
        Mark Miller
      4. SOLR-1301.patch
        2.49 MB
        Mark Miller
      5. SOLR-1301.patch
        2.47 MB
        Mark Miller
      6. SOLR-1301.patch
        2.44 MB
        Mark Miller
      7. SOLR-1301.patch
        2.33 MB
        Mark Miller
      8. SOLR-1301.patch
        963 kB
        Mark Miller
      9. SOLR-1301.patch
        58 kB
        Greg Bowyer
      10. hadoop-core-0.20.2-cdh3u3.jar
        3.43 MB
        Alexander Kanarsky
      11. SOLR-1301.patch
        58 kB
        Alexander Kanarsky
      12. SOLR-1301.patch
        64 kB
        Alexander Kanarsky
      13. SOLR-1301.patch
        64 kB
        Alexander Kanarsky
      14. hadoop-0.20.1-core.jar
        2.56 MB
        Jason Rutherglen
      15. SOLR-1301-hadoop-0-20.patch
        57 kB
        Matt Revelle
      16. SOLR-1301-hadoop-0-20.patch
        57 kB
        Alexander Kanarsky
      17. SOLR-1301.patch
        61 kB
        Jason Rutherglen
      18. SOLR-1301.patch
        61 kB
        Jason Rutherglen
      19. SOLR-1301.patch
        61 kB
        Jason Rutherglen
      20. log4j-1.2.15.jar
        383 kB
        Jason Rutherglen
      21. commons-logging-api-1.0.4.jar
        26 kB
        Jason Rutherglen
      22. commons-logging-1.0.4.jar
        37 kB
        Jason Rutherglen
      23. SOLR-1301.patch
        60 kB
        Jason Rutherglen
      24. README.txt
        3 kB
        Jason Venner (www.prohadoop.com)
      25. SolrRecordWriter.java
        10 kB
        Jason Venner (www.prohadoop.com)
      26. SOLR-1301.patch
        34 kB
        Kris Jirapinyo
      27. SOLR-1301.patch
        34 kB
        Jason Rutherglen
      28. SOLR-1301.patch
        33 kB
        Jason Rutherglen
      29. hadoop-0.19.1-core.jar
        2.27 MB
        Andrzej Bialecki
      30. hadoop.patch
        28 kB
        Andrzej Bialecki

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Andrzej Bialecki
            • Votes:
              30 Vote for this issue
              Watchers:
              54 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development