Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1301

Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold:

      • provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
      • avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network.

      Design
      ----------

      Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.

      The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken.

      This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard.

      An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead.

      This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib.

      Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License.

        Attachments

        1. commons-logging-1.0.4.jar
          37 kB
          Jason Rutherglen
        2. commons-logging-api-1.0.4.jar
          26 kB
          Jason Rutherglen
        3. hadoop.patch
          28 kB
          Andrzej Bialecki
        4. hadoop-0.19.1-core.jar
          2.27 MB
          Andrzej Bialecki
        5. hadoop-0.20.1-core.jar
          2.56 MB
          Jason Rutherglen
        6. hadoop-core-0.20.2-cdh3u3.jar
          3.43 MB
          Alexander Kanarsky
        7. log4j-1.2.15.jar
          383 kB
          Jason Rutherglen
        8. README.txt
          3 kB
          Jason Venner (www.prohadoop.com)
        9. SOLR-1301.patch
          4.59 MB
          Mark Miller
        10. SOLR-1301.patch
          4.59 MB
          Mark Miller
        11. SOLR-1301.patch
          2.49 MB
          Mark Miller
        12. SOLR-1301.patch
          2.47 MB
          Mark Miller
        13. SOLR-1301.patch
          2.44 MB
          Mark Miller
        14. SOLR-1301.patch
          2.33 MB
          Mark Miller
        15. SOLR-1301.patch
          963 kB
          Mark Miller
        16. SOLR-1301.patch
          58 kB
          Greg Bowyer
        17. SOLR-1301.patch
          58 kB
          Alexander Kanarsky
        18. SOLR-1301.patch
          64 kB
          Alexander Kanarsky
        19. SOLR-1301.patch
          64 kB
          Alexander Kanarsky
        20. SOLR-1301.patch
          61 kB
          Jason Rutherglen
        21. SOLR-1301.patch
          61 kB
          Jason Rutherglen
        22. SOLR-1301.patch
          61 kB
          Jason Rutherglen
        23. SOLR-1301.patch
          60 kB
          Jason Rutherglen
        24. SOLR-1301.patch
          34 kB
          Kris Jirapinyo
        25. SOLR-1301.patch
          34 kB
          Jason Rutherglen
        26. SOLR-1301.patch
          33 kB
          Jason Rutherglen
        27. SOLR-1301-hadoop-0-20.patch
          57 kB
          Matt Revelle
        28. SOLR-1301-hadoop-0-20.patch
          57 kB
          Alexander Kanarsky
        29. SOLR-1301-maven-intellij.patch
          39 kB
          Steve Rowe
        30. SolrRecordWriter.java
          10 kB
          Jason Venner (www.prohadoop.com)

          Issue Links

            Activity

              People

              • Assignee:
                markrmiller@gmail.com Mark Miller
                Reporter:
                ab Andrzej Bialecki
              • Votes:
                30 Vote for this issue
                Watchers:
                55 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: