Nutch
  1. Nutch
  2. NUTCH-1294

IndexClean job with solr implementation.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: nutchgora
    • Fix Version/s: 2.3
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      I started by copying/altering the trunk version of SolrClean, though is was inadequate for our needs. We needed to mark particular pages as gone even though they still might be visible on the web, this implementation abstracts the index cleaning process, has a Solr implementation, and adds a clean index plugin extension that allows others to tailor how pages might be removed from their store.

      1. NUTCH-1294-v3.patch
        18 kB
        Claudiu Chis
      2. NUTCH-1294-v2.patch
        18 kB
        Lewis John McGibbney
      3. NUTCH-1294.patch
        17 kB
        Dan Rosher

        Activity

        Hide
        Lewis John McGibbney added a comment -

        I think this is a really neat patch. The new extension point is a great addition to this often desired aspect of maintaining your index. The script in bin/nutch requires to be updated with the correct command, and the patch needs to be tested before we commit. I would be happy to get this tested once the blocker NUTCH-1205 has be resolved (which looks to be very soon). It would be great to get this into 2.0. Thanks Dan.

        Show
        Lewis John McGibbney added a comment - I think this is a really neat patch. The new extension point is a great addition to this often desired aspect of maintaining your index. The script in bin/nutch requires to be updated with the correct command, and the patch needs to be tested before we commit. I would be happy to get this tested once the blocker NUTCH-1205 has be resolved (which looks to be very soon). It would be great to get this into 2.0. Thanks Dan.
        Hide
        Lewis John McGibbney added a comment -

        New patch which makes trivial accomodations for the associated class(es) in conf/log4j.properties and adds the relevant CLI configuration to bin/nutch

        Show
        Lewis John McGibbney added a comment - New patch which makes trivial accomodations for the associated class(es) in conf/log4j.properties and adds the relevant CLI configuration to bin/nutch
        Hide
        Lewis John McGibbney added a comment -

        Meant to say, I'm still testing this out, but ended up identifying some peculiarities in the gora-cassandra backend when browsing through some debug logs ;0)

        Generally speaking I think this looks OK but would be great if others could provide some comments if and when you guys get around to it.

        Show
        Lewis John McGibbney added a comment - Meant to say, I'm still testing this out, but ended up identifying some peculiarities in the gora-cassandra backend when browsing through some debug logs ;0) Generally speaking I think this looks OK but would be great if others could provide some comments if and when you guys get around to it.
        Hide
        Lewis John McGibbney added a comment -

        Still not tested thoroughly enough so setting and classifying for 2.1

        Show
        Lewis John McGibbney added a comment - Still not tested thoroughly enough so setting and classifying for 2.1
        Hide
        Claudiu Chis added a comment -
        • no changes to java files
        • added logging for IndexCleanerJob
        • the patch now fully deploys (in v2 src/bin/nutch and conf/log4j.properties had to be applied manually)
        Show
        Claudiu Chis added a comment - no changes to java files added logging for IndexCleanerJob the patch now fully deploys (in v2 src/bin/nutch and conf/log4j.properties had to be applied manually)
        Hide
        Lewis John McGibbney added a comment -

        Thank you Claudiu.
        Does anyone have an issue committing this patch?
        I have verified it on my side and I am +1 for it.

        Show
        Lewis John McGibbney added a comment - Thank you Claudiu. Does anyone have an issue committing this patch? I have verified it on my side and I am +1 for it.
        Hide
        lufeng added a comment -

        passed testing with solr 4.2.1. +1 for commit.

        Show
        lufeng added a comment - passed testing with solr 4.2.1. +1 for commit.
        Hide
        Lewis John McGibbney added a comment -

        Hi Feng, can you please commit?
        I wont be able to commit this code right now.
        Thanks for review and great if you can commit here
        Best
        Lewis

        https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736978#comment-13736978]
        NUTCH-1294-v3.patch
        was inadequate for our needs. We needed to mark particular pages as gone
        even though they still might be visible on the web, this implementation
        abstracts the index cleaning process, has a Solr implementation, and adds a
        clean index plugin extension that allows others to tailor how pages might
        be removed from their store.
        administrators


        Lewis

        Show
        Lewis John McGibbney added a comment - Hi Feng, can you please commit? I wont be able to commit this code right now. Thanks for review and great if you can commit here Best Lewis https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736978#comment-13736978 ] NUTCH-1294 -v3.patch was inadequate for our needs. We needed to mark particular pages as gone even though they still might be visible on the web, this implementation abstracts the index cleaning process, has a Solr implementation, and adds a clean index plugin extension that allows others to tailor how pages might be removed from their store. administrators – Lewis
        Hide
        lufeng added a comment -

        Committed @revision 1513549 in 2.x HEAD

        Show
        lufeng added a comment - Committed @revision 1513549 in 2.x HEAD
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Nutch-nutchgora #717 (See https://builds.apache.org/job/Nutch-nutchgora/717/)
        NUTCH-1294 IndexClean job with solr implementation. (fenglu: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1513548)

        • /nutch/branches/2.x/CHANGES.txt
          NUTCH-1294 IndexClean job with solr implementation. (fenglu: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1513543)
        • /nutch/branches/2.x/conf/log4j.properties
        • /nutch/branches/2.x/src/bin/nutch
        • /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexCleanerJob.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexCleaningFilter.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexCleaningFilters.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrClean.java
        • /nutch/branches/2.x/src/plugin/nutch-extensionpoints/plugin.xml
        Show
        Hudson added a comment - SUCCESS: Integrated in Nutch-nutchgora #717 (See https://builds.apache.org/job/Nutch-nutchgora/717/ ) NUTCH-1294 IndexClean job with solr implementation. (fenglu: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1513548 ) /nutch/branches/2.x/CHANGES.txt NUTCH-1294 IndexClean job with solr implementation. (fenglu: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1513543 ) /nutch/branches/2.x/conf/log4j.properties /nutch/branches/2.x/src/bin/nutch /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexCleanerJob.java /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexCleaningFilter.java /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexCleaningFilters.java /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrClean.java /nutch/branches/2.x/src/plugin/nutch-extensionpoints/plugin.xml
        Hide
        Lewis John McGibbney added a comment -

        Thanks Feng. Can you also do README.txt when you get a minute please?
        Thank you v much.

        Show
        Lewis John McGibbney added a comment - Thanks Feng. Can you also do README.txt when you get a minute please? Thank you v much.
        Hide
        lufeng added a comment -

        Hi Lewis. Very pleasure. But What can I do something for README.txt? Do you mean I will also change something in https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt.

        Show
        lufeng added a comment - Hi Lewis. Very pleasure. But What can I do something for README.txt? Do you mean I will also change something in https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt .
        Hide
        Lewis John McGibbney added a comment -

        sorry I meant changes.txt

        https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739731#comment-13739731]
        you mean I will also change something in
        https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt.
        NUTCH-1294-v3.patch
        was inadequate for our needs. We needed to mark particular pages as gone
        even though they still might be visible on the web, this implementation
        abstracts the index cleaning process, has a Solr implementation, and adds a
        clean index plugin extension that allows others to tailor how pages might
        be removed from their store.
        administrators


        Lewis

        Show
        Lewis John McGibbney added a comment - sorry I meant changes.txt https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739731#comment-13739731 ] you mean I will also change something in https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt . NUTCH-1294 -v3.patch was inadequate for our needs. We needed to mark particular pages as gone even though they still might be visible on the web, this implementation abstracts the index cleaning process, has a Solr implementation, and adds a clean index plugin extension that allows others to tailor how pages might be removed from their store. administrators – Lewis

          People

          • Assignee:
            Unassigned
            Reporter:
            Dan Rosher
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development