Nutch
  1. Nutch
  2. NUTCH-760

Allow field mapping from nutch to solr index

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      I am using nutch to crawl sites and have combined it
      with solr pushing the nutch index using the solrindex command. I have
      set it up as specified on the wiki using the copyField url to id in the
      schema. Whilst this works fine it is stuff's up my inputs from other
      sources in solr (e.g. using the solr data import handler) as they have
      both id's and url's. I have patch that implements a nutch xml schema
      defining what basic nutch fields map to in your solr push.

      1. solrindex_schema.patch
        4 kB
        David Stuart
      2. solrindex_schema.patch
        5 kB
        David Stuart
      3. solrindex_schema.patch
        12 kB
        David Stuart
      4. solrindex_schema.patch
        12 kB
        David Stuart

        Activity

        Hide
        David Stuart added a comment -

        First pass at a schema reader for mapping basic nutch fields to solr

        Show
        David Stuart added a comment - First pass at a schema reader for mapping basic nutch fields to solr
        Hide
        David Stuart added a comment -

        oops left out schema file

        Show
        David Stuart added a comment - oops left out schema file
        Hide
        Andrzej Bialecki added a comment -

        Thanks David, this is a good start. We also need to address the searching part, i.e. SolrSearchBean, where Nutch hardcodes the same field names.

        Show
        Andrzej Bialecki added a comment - Thanks David, this is a good start. We also need to address the searching part, i.e. SolrSearchBean, where Nutch hardcodes the same field names.
        Hide
        David Stuart added a comment -

        Updated patch with the modifications to the SolrSearchBean. Have also re factored a wee bit to allow other classes to hook into the solr index schema

        Show
        David Stuart added a comment - Updated patch with the modifications to the SolrSearchBean. Have also re factored a wee bit to allow other classes to hook into the solr index schema
        Hide
        Andrzej Bialecki added a comment -

        A few comments to the latest patch:

        • the description of the property in nutch-default.xml could be more descriptive
        • <schema> element has name and version attributes - do we really need these? It's not a Solr schema.xml anyway, so we don't have to pretend that we follow the same format.
        • SolrSchemaReader uses static instance of NutchConfiguration - this is a big no-no, the whole point of using the property in nutch-default.xml is that you could set different values, and making this field static basically pins down the configuration to the version set on the first instantiation of the class ... Please do as other similar classes do - implement Configurable, or add Configuration to the constructor, and pass the current job configuration where appropriate.
        • consequently, static references to SolrSchemaReader need to be un-staticized in other places.
        • minor nits: code formatting should use 2 literal spaces indents. There are some accidental changes in NutchBean and SolrWriter.
        Show
        Andrzej Bialecki added a comment - A few comments to the latest patch: the description of the property in nutch-default.xml could be more descriptive <schema> element has name and version attributes - do we really need these? It's not a Solr schema.xml anyway, so we don't have to pretend that we follow the same format. SolrSchemaReader uses static instance of NutchConfiguration - this is a big no-no, the whole point of using the property in nutch-default.xml is that you could set different values, and making this field static basically pins down the configuration to the version set on the first instantiation of the class ... Please do as other similar classes do - implement Configurable, or add Configuration to the constructor, and pass the current job configuration where appropriate. consequently, static references to SolrSchemaReader need to be un-staticized in other places. minor nits: code formatting should use 2 literal spaces indents. There are some accidental changes in NutchBean and SolrWriter.
        Hide
        David Stuart added a comment -

        Thanks,

        I will have another go. It quite a big task getting my head around all of the
        ins and outs of nutch but its good to help to contribute to a great product

        Regards,

        Dave

        Show
        David Stuart added a comment - Thanks, I will have another go. It quite a big task getting my head around all of the ins and outs of nutch but its good to help to contribute to a great product Regards, Dave
        Hide
        David Stuart added a comment -

        Have updated patch as per comment below

        • the description of the property in nutch-default.xml could be more descriptive
        • <schema> element has name and version attributes - do we really need these? It's not a Solr schema.xml anyway, so we don't have to pretend that we follow the same format.
        • SolrSchemaReader uses static instance of NutchConfiguration - this is a big no-no, the whole point of using the property in nutch-default.xml is that you could set different values, and making this field static basically pins down the configuration to the version set on the first instantiation of the class ... Please do as other similar classes do - implement Configurable, or add Configuration to the constructor, and pass the current job configuration where appropriate.
        • consequently, static references to SolrSchemaReader need to be un-staticized in other places.
        • minor nits: code formatting should use 2 literal spaces indents. There are some accidental changes in NutchBean and SolrWriter.
        Show
        David Stuart added a comment - Have updated patch as per comment below the description of the property in nutch-default.xml could be more descriptive <schema> element has name and version attributes - do we really need these? It's not a Solr schema.xml anyway, so we don't have to pretend that we follow the same format. SolrSchemaReader uses static instance of NutchConfiguration - this is a big no-no, the whole point of using the property in nutch-default.xml is that you could set different values, and making this field static basically pins down the configuration to the version set on the first instantiation of the class ... Please do as other similar classes do - implement Configurable, or add Configuration to the constructor, and pass the current job configuration where appropriate. consequently, static references to SolrSchemaReader need to be un-staticized in other places. minor nits: code formatting should use 2 literal spaces indents. There are some accidental changes in NutchBean and SolrWriter.
        Hide
        David Stuart added a comment -

        Hi Andrzej,

        I have amended the patch to incorporate your suggestions
        https://issues.apache.org/jira/browse/NUTCH-760

        Regards,

        Dave

        Show
        David Stuart added a comment - Hi Andrzej, I have amended the patch to incorporate your suggestions https://issues.apache.org/jira/browse/NUTCH-760 Regards, Dave
        Hide
        Andrzej Bialecki added a comment -

        I reworked the patch to get rid of any left-overs of static Configuration, and changed the concept of "schema" (which was misleading) to "mapping" throughout the patch and class names.

        This is now committed in rev. 884269 - thanks!

        Show
        Andrzej Bialecki added a comment - I reworked the patch to get rid of any left-overs of static Configuration, and changed the concept of "schema" (which was misleading) to "mapping" throughout the patch and class names. This is now committed in rev. 884269 - thanks!
        Hide
        Hudson added a comment -

        Integrated in Nutch-trunk #995 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/)
        Add part of .
        Allow field mapping from nutch to solr index.

        Show
        Hudson added a comment - Integrated in Nutch-trunk #995 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/ ) Add part of . Allow field mapping from nutch to solr index.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            David Stuart
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development