Nutch
  1. Nutch
  2. NUTCH-1024

Dynamically set fetchInterval by MIME-type

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6
    • Component/s: generator
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.

      • simple key\tvalue\n configuration file
      • only set fetchInterval for new documents
      • keep max fetchInterval fixed by current config
      1. NUTCH-1024-1.5-3.patch
        17 kB
        Markus Jelsma
      2. NUTCH-1024-1.5-2.patch
        15 kB
        Markus Jelsma
      3. NUTCH-1024-1.5-1.patch
        14 kB
        Markus Jelsma
      4. Nutch.patch
        0.5 kB
        Markus Jelsma
      5. MimeAdaptiveFetchSchedule.java
        8 kB
        Markus Jelsma
      6. adaptive-mimetypes.txt
        0.1 kB
        Markus Jelsma
      7. AdaptiveFetchSchedule.patch
        0.5 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          Markus Jelsma added a comment -

          MIME-type information is required in the CrawlDatum for this to work. NUTCH-779 adds an option to specify ParseMeta keys to be added to the CrawlDatum.

          Show
          Markus Jelsma added a comment - MIME-type information is required in the CrawlDatum for this to work. NUTCH-779 adds an option to specify ParseMeta keys to be added to the CrawlDatum.
          Hide
          Markus Jelsma added a comment -

          Here's a first WIP. It extends AdaptiveFetchSchedule and changes INC/DEC rates depending on current MIME-type. It also patches AdaptiveFetch so that INC and DEC properties are protected and settable from the child. I also added two propertis to metadata.Nutch for reading the Content-Type key as Writable from the CrawlDatum MetaData. That was a bit of trickery.

          It uses original INC and DEC rate values for CrawlDatum without a Content-Type in their MetaData or with unconfigured Content-Types.

          Please comment. There must be something wrong as it seems to work.

          Show
          Markus Jelsma added a comment - Here's a first WIP. It extends AdaptiveFetchSchedule and changes INC/DEC rates depending on current MIME-type. It also patches AdaptiveFetch so that INC and DEC properties are protected and settable from the child. I also added two propertis to metadata.Nutch for reading the Content-Type key as Writable from the CrawlDatum MetaData. That was a bit of trickery. It uses original INC and DEC rate values for CrawlDatum without a Content-Type in their MetaData or with unconfigured Content-Types. Please comment. There must be something wrong as it seems to work.
          Hide
          Markus Jelsma added a comment -

          New version with proper handling of Content-Type attrib. In test i didn't include charset which is present in real tests.

          Show
          Markus Jelsma added a comment - New version with proper handling of Content-Type attrib. In test i didn't include charset which is present in real tests.
          Hide
          Markus Jelsma added a comment -

          New version that allows for separate inc and dec rate values per MIME-type. Conf file format is now: mime\tinc_rate\tdec_rate. Code uses internal struct for storing rates per mime in a hashmap.

          Please comment.

          Show
          Markus Jelsma added a comment - New version that allows for separate inc and dec rate values per MIME-type. Conf file format is now: mime\tinc_rate\tdec_rate. Code uses internal struct for storing rates per mime in a hashmap. Please comment.
          Hide
          Markus Jelsma added a comment -

          I'd like to commit this issue this friday unless there are objections or other comments.

          Show
          Markus Jelsma added a comment - I'd like to commit this issue this friday unless there are objections or other comments.
          Hide
          Julien Nioche added a comment -

          Do you mind if we wait a bit? I'd like to spend some time on it first and see how this would fit with the refresh info we would get from the sitemap entries

          Show
          Julien Nioche added a comment - Do you mind if we wait a bit? I'd like to spend some time on it first and see how this would fit with the refresh info we would get from the sitemap entries
          Hide
          Markus Jelsma added a comment -

          Sure but what do you mean by info from sitemap entries? Is there an issue to point to?

          Show
          Markus Jelsma added a comment - Sure but what do you mean by info from sitemap entries? Is there an issue to point to?
          Hide
          Julien Nioche added a comment -

          There is a JIRA issue for 2.0 https://issues.apache.org/jira/browse/NUTCH-882, but I'd like to do it in 1.4

          We've talked about processing sitemaps on the mailing lists for some time and now have crawler-commons to help us with the parsing. Entries in sitemaps have some info about how frequently they are likely to be modified so it is somewhat related to this issue.

          Show
          Julien Nioche added a comment - There is a JIRA issue for 2.0 https://issues.apache.org/jira/browse/NUTCH-882 , but I'd like to do it in 1.4 We've talked about processing sitemaps on the mailing lists for some time and now have crawler-commons to help us with the parsing. Entries in sitemaps have some info about how frequently they are likely to be modified so it is somewhat related to this issue.
          Hide
          Markus Jelsma added a comment -

          Integration with sitemaps and crawler commons is something that's not being implemented now. Should we include this in 1.4? Ir does offer good flexibility on large crawls with semi-immutable mime-types.

          Show
          Markus Jelsma added a comment - Integration with sitemaps and crawler commons is something that's not being implemented now. Should we include this in 1.4? Ir does offer good flexibility on large crawls with semi-immutable mime-types.
          Hide
          Markus Jelsma added a comment - - edited

          New patch for trunk! This also includes a change to the injector where injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected interval overrides anything else. This is useful for sites where you want to use AdaptiveFetchSchedule but still want the generator to select an injected homepage every N hours.

          Show
          Markus Jelsma added a comment - - edited New patch for trunk! This also includes a change to the injector where injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected interval overrides anything else. This is useful for sites where you want to use AdaptiveFetchSchedule but still want the generator to select an injected homepage every N hours.
          Hide
          Markus Jelsma added a comment -

          Thoughts? I'd like to send this one in.

          Show
          Markus Jelsma added a comment - Thoughts? I'd like to send this one in.
          Hide
          Julien Nioche added a comment -

          Hi Markus

          Will have a closer look later. 2 quick comments for now

          AdaptiveFetchSchedule => remove calls to System.out and use logging instead
          Metadata/Nutch => MIME_TYPE_KEY duplicates the one in Metadata/HttpHeaders

          Show
          Julien Nioche added a comment - Hi Markus Will have a closer look later. 2 quick comments for now AdaptiveFetchSchedule => remove calls to System.out and use logging instead Metadata/Nutch => MIME_TYPE_KEY duplicates the one in Metadata/HttpHeaders
          Hide
          Markus Jelsma added a comment -

          I'll change the legacy sys.out to logging. HttpHeaders doesnt have Text representations of the strings but i'll be happy to add if you want.

          Show
          Markus Jelsma added a comment - I'll change the legacy sys.out to logging. HttpHeaders doesnt have Text representations of the strings but i'll be happy to add if you want.
          Hide
          Markus Jelsma added a comment -

          New patch for 1.5 with modifications as per Julien's comments.

          Show
          Markus Jelsma added a comment - New patch for 1.5 with modifications as per Julien's comments.
          Hide
          Lewis John McGibbney added a comment -

          I like this Markus. Although I need to be honest and say that I've not had time to give it a spin as of recent so apologies for this. It looks like the process to date has been a bit frustrating so I apologize for not chipping in earlier. Anyway, we don't rely on commons for logging, could you please replace this with

          import org.slf4j.Logger;
          import org.slf4j.LoggerFactory;
          

          Another further point from me:

          You make refernce to the following conf directories

          SCHEDULE_INC_RATE = "db.fetch.schedule.adaptive.inc_rate";
          SCHEDULE_DEC_RATE = "db.fetch.schedule.adaptive.dec_rate";
          SCHEDULE_MIME_FILE = "db.fetch.schedule.mime.file";
          

          Although I don't see the new MIME_FILE added to the patch, I also don't see the INC and DEC properties added to nutch-default.xml
          Thanks

          Show
          Lewis John McGibbney added a comment - I like this Markus. Although I need to be honest and say that I've not had time to give it a spin as of recent so apologies for this. It looks like the process to date has been a bit frustrating so I apologize for not chipping in earlier. Anyway, we don't rely on commons for logging, could you please replace this with import org.slf4j.Logger; import org.slf4j.LoggerFactory; Another further point from me: You make refernce to the following conf directories SCHEDULE_INC_RATE = "db.fetch.schedule.adaptive.inc_rate" ; SCHEDULE_DEC_RATE = "db.fetch.schedule.adaptive.dec_rate" ; SCHEDULE_MIME_FILE = "db.fetch.schedule.mime.file" ; Although I don't see the new MIME_FILE added to the patch, I also don't see the INC and DEC properties added to nutch-default.xml Thanks
          Hide
          Markus Jelsma added a comment -

          I'll fix the logging, this is old code. The inc and dec rate directives are already in nutch-default but the mime-file and the file itself are missing.

          Show
          Markus Jelsma added a comment - I'll fix the logging, this is old code. The inc and dec rate directives are already in nutch-default but the mime-file and the file itself are missing.
          Hide
          Markus Jelsma added a comment -

          New patch with proper logging and configuration files.

          Show
          Markus Jelsma added a comment - New patch with proper logging and configuration files.
          Hide
          Markus Jelsma added a comment -

          Something went wrong here.

          Show
          Markus Jelsma added a comment - Something went wrong here.
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Markus Jelsma added a comment -

          I'll commit this one in the next few days unless there are objections or improvements. Thanks

          Show
          Markus Jelsma added a comment - I'll commit this one in the next few days unless there are objections or improvements. Thanks
          Hide
          Markus Jelsma added a comment -

          Committed for 1.6 in rev. 1349226.
          Thanks!

          Show
          Markus Jelsma added a comment - Committed for 1.6 in rev. 1349226. Thanks!
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #310 (See https://builds.apache.org/job/nutch-trunk-maven/310/)
          NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226)

          Result = SUCCESS
          markus :
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/conf/adaptive-mimetypes.txt
          • /nutch/trunk/conf/nutch-default.xml
          • /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
          • /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java
          • /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
          • /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #310 (See https://builds.apache.org/job/nutch-trunk-maven/310/ ) NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226) Result = SUCCESS markus : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/adaptive-mimetypes.txt /nutch/trunk/conf/nutch-default.xml /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1869 (See https://builds.apache.org/job/Nutch-trunk/1869/)
          NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226)

          Result = SUCCESS
          markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349226
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/conf/adaptive-mimetypes.txt
          • /nutch/trunk/conf/nutch-default.xml
          • /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
          • /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java
          • /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
          • /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1869 (See https://builds.apache.org/job/Nutch-trunk/1869/ ) NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349226 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/adaptive-mimetypes.txt /nutch/trunk/conf/nutch-default.xml /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development