Solr
  1. Solr
  2. SOLR-1045

Build Solr index using Hadoop MapReduce

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      The goal is a contrib module that builds Solr index using Hadoop MapReduce.

      It is different from the Solr support in Nutch. The Solr support in Nutch sends a document to a Solr server in a reduce task. Here, the goal is to build/update Solr index within map/reduce tasks. Also, it achieves better parallelism when the number of map tasks is greater than the number of reduce tasks, which is usually the case.

        Issue Links

          Activity

          Ning Li created issue -
          Ning Li made changes -
          Field Original Value New Value
          Attachment SOLR-1045.0.patch [ 12401278 ]
          Hide
          Ning Li added a comment -

          The purpose of this simple initial version is to give people an idea of the functionality. It uses Hadoop contrib/index, which uses Hadoop mapred package. Future versions will be very different from this version. The main difference is that in this version, after a Solr input document is converted to a Lucene document, a Lucene index writer is used to build the index. In future versions, a Solr writer/core will be used.

          Here are some pre-requisites for this issue:

          • Hadoop 0.20. Hadoop 0.20 is to be released. There are two features in 0.20 that are important for this issue.
            First is the new mapreduce package. The flexibility of the new mapreduce api makes it possible to use a Solr writer/core in mapper tasks.
            Second is the upgrade to Jetty 6 (6.1.14). The current release 0.19 uses Jetty 5.
          • There are a couple of changes required in Solr.
            First is to make SolrCore support an indexing-only mode (i.e. no search). Only then is it feasible to use it for indexing in a map task.
            Second is to upgrate from Jetty 6.1.3 to Jetty 6.1.14. Hadoop 0.20 uses a feature that is not available in 6.1.3.

          What do you think about making "SolrCore support an indexing-only mode"?

          Show
          Ning Li added a comment - The purpose of this simple initial version is to give people an idea of the functionality. It uses Hadoop contrib/index, which uses Hadoop mapred package. Future versions will be very different from this version. The main difference is that in this version, after a Solr input document is converted to a Lucene document, a Lucene index writer is used to build the index. In future versions, a Solr writer/core will be used. Here are some pre-requisites for this issue: Hadoop 0.20. Hadoop 0.20 is to be released. There are two features in 0.20 that are important for this issue. First is the new mapreduce package. The flexibility of the new mapreduce api makes it possible to use a Solr writer/core in mapper tasks. Second is the upgrade to Jetty 6 (6.1.14). The current release 0.19 uses Jetty 5. There are a couple of changes required in Solr. First is to make SolrCore support an indexing-only mode (i.e. no search). Only then is it feasible to use it for indexing in a map task. Second is to upgrate from Jetty 6.1.3 to Jetty 6.1.14. Hadoop 0.20 uses a feature that is not available in 6.1.3. What do you think about making "SolrCore support an indexing-only mode"?
          Hide
          Noble Paul added a comment -

          First is to make SolrCore support an indexing-only mode (i.e. no search)

          why is this a pre-requisite?

          Show
          Noble Paul added a comment - First is to make SolrCore support an indexing-only mode (i.e. no search) why is this a pre-requisite?
          Hide
          Ning Li added a comment -

          If SolrCore supports an indexing-only mode, no resource will be spent on search, which is not used by the mapreduce job. If you feel this is "good-to-have" instead of "must-have", then I think this is an important "good-to-have".

          Show
          Ning Li added a comment - If SolrCore supports an indexing-only mode, no resource will be spent on search, which is not used by the mapreduce job. If you feel this is "good-to-have" instead of "must-have", then I think this is an important "good-to-have".
          Hide
          Ning Li added a comment -

          Building Solr index (the data directory) in a mapreduce job also means we should be able to:

          • write a Solr index in a ram directory
          • merge multiple Solr indexes into one Solr index

          Any objections if I open Jira issues on supporting these two features?

          Show
          Ning Li added a comment - Building Solr index (the data directory) in a mapreduce job also means we should be able to: write a Solr index in a ram directory merge multiple Solr indexes into one Solr index Any objections if I open Jira issues on supporting these two features?
          Hide
          Shalin Shekhar Mangar added a comment -

          write a Solr index in a ram directory

          It is possible to use a RAMDirectory but I haven't tried. See SOLR-465 for details.

          merge multiple Solr indexes into one Solr index

          Please go ahead. Do you mean merging indexes of two solr cores? I have thought of exposing that as a CoreAdmin command.

          Show
          Shalin Shekhar Mangar added a comment - write a Solr index in a ram directory It is possible to use a RAMDirectory but I haven't tried. See SOLR-465 for details. merge multiple Solr indexes into one Solr index Please go ahead. Do you mean merging indexes of two solr cores? I have thought of exposing that as a CoreAdmin command.
          Hide
          Yonik Seeley added a comment -

          merge multiple Solr indexes into one Solr index

          +1
          I think Solr should support multiple local indexes (call them fragments?) per "index" and be able to perform operations such as merging.
          I mentioned this here a while ago too:
          http://www.lucidimagination.com/search/document/de518893396af002/solr2_onward_and_upward

          Show
          Yonik Seeley added a comment - merge multiple Solr indexes into one Solr index +1 I think Solr should support multiple local indexes (call them fragments?) per "index" and be able to perform operations such as merging. I mentioned this here a while ago too: http://www.lucidimagination.com/search/document/de518893396af002/solr2_onward_and_upward
          Hide
          Ning Li added a comment -

          Shalin and Yonik, thanks for the comments on the two features. But what is a Solr index? I thought it is everything in the data directory, not just the Lucene index in the data/index directory, no? If that's the case:

          • On writing a Solr index in a ram directory, I'm aware of the directory factory, but it's only for the directory of Lucene index.
          • On merging multiple Solr indexes, besides merging the Lucene indexes, it also means somehow "merging" other data in the data directory (e.g. "merging" by rebuilding the spell check index).

          Am I correct?

          Show
          Ning Li added a comment - Shalin and Yonik, thanks for the comments on the two features. But what is a Solr index? I thought it is everything in the data directory, not just the Lucene index in the data/index directory, no? If that's the case: On writing a Solr index in a ram directory, I'm aware of the directory factory, but it's only for the directory of Lucene index. On merging multiple Solr indexes, besides merging the Lucene indexes, it also means somehow "merging" other data in the data directory (e.g. "merging" by rebuilding the spell check index). Am I correct?
          Hide
          Yonik Seeley added a comment -

          The "solr index" is normally just a Lucene index which has been indexed according to the particular schema.
          There are exceptions, as you have noted:

          • the spell check index
          • ExternalFileField

          It's worth keeping these in mind, and perhaps could be useful to be able to handle at some point, but it certainly doesn't seem critical.

          Show
          Yonik Seeley added a comment - The "solr index" is normally just a Lucene index which has been indexed according to the particular schema. There are exceptions, as you have noted: the spell check index ExternalFileField It's worth keeping these in mind, and perhaps could be useful to be able to handle at some point, but it certainly doesn't seem critical.
          Ning Li made changes -
          Link This issue is blocked by SOLR-1051 [ SOLR-1051 ]
          Grant Ingersoll made changes -
          Link This issue is related to SOLR-1301 [ SOLR-1301 ]
          Hide
          Alex Baranau added a comment -

          write a Solr index in a ram directory

          Please, take a look at https://issues.apache.org/jira/browse/SOLR-1379 - RAMDirectoryFactory

          Show
          Alex Baranau added a comment - write a Solr index in a ram directory Please, take a look at https://issues.apache.org/jira/browse/SOLR-1379 - RAMDirectoryFactory
          Hide
          Lance Norskog added a comment -

          Map/Reduce would also be useful in the DataImportHandler. We're talking about parallelizing analysis stacks that require a lot of CPU. I would rather push this sort of thing out into the DIH - Solr Cell, for example. The DIH declaration language could have something like the ANT parallelization directives.

          At this level of multi-threaded sophistication, Solr really wants to be an OSGi application instead of a custom-built mini application server.

          Show
          Lance Norskog added a comment - Map/Reduce would also be useful in the DataImportHandler. We're talking about parallelizing analysis stacks that require a lot of CPU. I would rather push this sort of thing out into the DIH - Solr Cell, for example. The DIH declaration language could have something like the ANT parallelization directives. At this level of multi-threaded sophistication, Solr really wants to be an OSGi application instead of a custom-built mini application server.
          Shalin Shekhar Mangar made changes -
          Fix Version/s 1.5 [ 12313566 ]
          Hide
          Kevin Peterson added a comment -

          Can anyone using this code comment on how this relates to SOLR-1301?

          https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828915#action_12828915

          These seem to have identical goals but very different approaches.

          Show
          Kevin Peterson added a comment - Can anyone using this code comment on how this relates to SOLR-1301 ? https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828915#action_12828915 These seem to have identical goals but very different approaches.
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hoss Man made changes -
          Fix Version/s Next [ 12315093 ]
          Fix Version/s 1.5 [ 12313566 ]
          Hoss Man made changes -
          Fix Version/s 3.2 [ 12316172 ]
          Fix Version/s Next [ 12315093 ]
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Robert Muir made changes -
          Fix Version/s 3.3 [ 12316471 ]
          Fix Version/s 3.2 [ 12316172 ]
          Robert Muir made changes -
          Fix Version/s 3.4 [ 12316683 ]
          Fix Version/s 4.0 [ 12314992 ]
          Fix Version/s 3.3 [ 12316471 ]
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Robert Muir made changes -
          Fix Version/s 3.5 [ 12317876 ]
          Fix Version/s 3.4 [ 12316683 ]
          Simon Willnauer made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Fix Version/s 3.5 [ 12317876 ]
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hoss Man made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Robert Muir made changes -
          Fix Version/s 4.1 [ 12321141 ]
          Fix Version/s 4.0 [ 12314992 ]
          Mark Miller made changes -
          Fix Version/s 4.2 [ 12323893 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.1 [ 12321141 ]
          Robert Muir made changes -
          Fix Version/s 4.3 [ 12324128 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.2 [ 12323893 ]
          Uwe Schindler made changes -
          Fix Version/s 4.4 [ 12324324 ]
          Fix Version/s 4.3 [ 12324128 ]
          Hide
          Furkan KAMACI added a comment -

          Is there any improvement for that issue otherwise I can make a development for it?

          Show
          Furkan KAMACI added a comment - Is there any improvement for that issue otherwise I can make a development for it?
          Hide
          Otis Gospodnetic added a comment -

          Is there any improvement for that issue otherwise I can make a development for it?

          Please go for it! See also SOLR-1301

          Show
          Otis Gospodnetic added a comment - Is there any improvement for that issue otherwise I can make a development for it? Please go for it! See also SOLR-1301
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Steve Rowe made changes -
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Fix Version/s 4.4 [ 12324324 ]
          Adrien Grand made changes -
          Fix Version/s 4.6 [ 12325000 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Uwe Schindler made changes -
          Fix Version/s 4.7 [ 12325573 ]
          Fix Version/s 4.6 [ 12325000 ]
          David Smiley made changes -
          Fix Version/s 4.8 [ 12326254 ]
          Fix Version/s 4.7 [ 12325573 ]
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          Uwe Schindler made changes -
          Fix Version/s 4.9 [ 12326731 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.8 [ 12326254 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Ning Li
            • Votes:
              13 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

              • Created:
                Updated:

                Development