Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.14
    • Component/s: parser
    • Labels:
      None

      Description

      I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].

      [0] http://sourceforge.net/projects/sitemap-parser/
      [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

      1. NUTCH-1465.patch
        27 kB
        Markus Jelsma
      2. NUTCH-1465.patch
        27 kB
        Markus Jelsma
      3. NUTCH-1465.patch
        27 kB
        Markus Jelsma
      4. NUTCH-1465.patch
        27 kB
        Markus Jelsma
      5. NUTCH-1465-sitemapinjector-trunk-v1.patch
        17 kB
        Sebastian Nagel
      6. NUTCH-1465-trunk.v1.patch
        27 kB
        Tejas Patil
      7. NUTCH-1465-trunk.v2.patch
        16 kB
        Tejas Patil
      8. NUTCH-1465-trunk.v3.patch
        19 kB
        Tejas Patil
      9. NUTCH-1465-trunk.v4.patch
        19 kB
        Tejas Patil
      10. NUTCH-1465-trunk.v5.patch
        21 kB
        Tejas Patil

        Issue Links

          Activity

          Hide
          kkrugler Ken Krugler added a comment -

          The sitemap parsing code referenced in the discussion you note has been placed in crawler-commons. We just finished using it during a crawl (fixed one bug, dealing with sitemaps that have a BOM) and it worked fine for the sites we were crawling.

          Show
          kkrugler Ken Krugler added a comment - The sitemap parsing code referenced in the discussion you note has been placed in crawler-commons. We just finished using it during a crawl (fixed one bug, dealing with sitemaps that have a BOM) and it worked fine for the sites we were crawling.
          Hide
          lewismc Lewis John McGibbney added a comment -

          I think I can invisage the next comment on this thread... this is yet another reason to use crawler commons :0)
          Ken I wonder if you would be so kind to start a thread over on dev@nutch regarding the atmosphere going on over @ CC... it was my thought that we were flogging a dead horse with this conversation but the duplication of issues over here that are quite clearly included in CC seems rather ridiculous.

          Show
          lewismc Lewis John McGibbney added a comment - I think I can invisage the next comment on this thread... this is yet another reason to use crawler commons :0) Ken I wonder if you would be so kind to start a thread over on dev@nutch regarding the atmosphere going on over @ CC... it was my thought that we were flogging a dead horse with this conversation but the duplication of issues over here that are quite clearly included in CC seems rather ridiculous.
          Hide
          kkrugler Ken Krugler added a comment -

          Hi Lewis - I could start a thread, but I also don't want to flog a dead horse

          I'm spending occasional small amounts of time trying to move code from Bixo over to CC, and the plan is for the 0.9 release of Bixo to switch over to using CC where possible.

          But the lack of excitement among Droids, Heretrix, Common Crawl, Nutch, etc. has made it pretty clear getting wide-spread adoption would be an uphill battle, one that I don't have the time currently to fight.

          – Ken

          Show
          kkrugler Ken Krugler added a comment - Hi Lewis - I could start a thread, but I also don't want to flog a dead horse I'm spending occasional small amounts of time trying to move code from Bixo over to CC, and the plan is for the 0.9 release of Bixo to switch over to using CC where possible. But the lack of excitement among Droids, Heretrix, Common Crawl, Nutch, etc. has made it pretty clear getting wide-spread adoption would be an uphill battle, one that I don't have the time currently to fight. – Ken
          Hide
          lewismc Lewis John McGibbney added a comment -

          Hi Ken,

          {bq} I could start a thread, but I also don't want to flog a dead horse {bq}

          I thought there had been renewed interest over @ CC but it looks like this is not the case. So I guess that we can progress with moving the sitemap-parser into Nutch. There have been people from the community who would like it I therefore see no reason not to. There was also mention of the canonical tag topic again in the thread I cited above (and there are also issues already logged on our Jira for this as well) so it will be interesting to see what the code contains.

          Show
          lewismc Lewis John McGibbney added a comment - Hi Ken, {bq} I could start a thread, but I also don't want to flog a dead horse {bq} I thought there had been renewed interest over @ CC but it looks like this is not the case. So I guess that we can progress with moving the sitemap-parser into Nutch. There have been people from the community who would like it I therefore see no reason not to. There was also mention of the canonical tag topic again in the thread I cited above (and there are also issues already logged on our Jira for this as well) so it will be interesting to see what the code contains.
          Hide
          kkrugler Ken Krugler added a comment -

          Hi Lewis,

          Just to be clear, I think the dead horse is trying to get people interested in porting their code to crawler-commons, and then switching existing functionality to rely on cc.

          For anything new (like sitemap parsing) I think it's a no-brainer to use cc, unless the API is totally borked. E.g. if you didn't, then you wouldn't have picked up our BOM fix.

          – Ken

          Show
          kkrugler Ken Krugler added a comment - Hi Lewis, Just to be clear, I think the dead horse is trying to get people interested in porting their code to crawler-commons, and then switching existing functionality to rely on cc. For anything new (like sitemap parsing) I think it's a no-brainer to use cc, unless the API is totally borked. E.g. if you didn't, then you wouldn't have picked up our BOM fix. – Ken
          Hide
          lewismc Lewis John McGibbney added a comment -

          So CC it is for sitemap parsing support in Nutch :0)

          Show
          lewismc Lewis John McGibbney added a comment - So CC it is for sitemap parsing support in Nutch :0)
          Hide
          tejasp Tejas Patil added a comment -

          This is a work in progress. So far I have done following:

          • added new status named STATUS_SITEMAP to CrawlDatum. I plan to use it to identify the sitemap urls in update phase using this status.
          • modified the robots parsing code to extract the links to sitemap pages.
          • Added a new class SitemapProcessor which will cache the links to sitemap pages, use the sitemap parser in CC and take care so that for a given host, sitemaps are processed just once.

          Attached a patch (NUTCH-1465-trunk.v1.patch) for the changes.
          Things pending:

          • write the sitemap urls (from Fetcher class) to the segments in form of CrawlDatum entries
          • modify the update phase to take care of STATUS_SITEMAP and update the crawl frequency.

          If anyone has any suggestions in terms of design and approach, please let me know.

          Show
          tejasp Tejas Patil added a comment - This is a work in progress. So far I have done following: added new status named STATUS_SITEMAP to CrawlDatum. I plan to use it to identify the sitemap urls in update phase using this status. modified the robots parsing code to extract the links to sitemap pages. Added a new class SitemapProcessor which will cache the links to sitemap pages, use the sitemap parser in CC and take care so that for a given host, sitemaps are processed just once. Attached a patch ( NUTCH-1465 -trunk.v1.patch) for the changes. Things pending: write the sitemap urls (from Fetcher class) to the segments in form of CrawlDatum entries modify the update phase to take care of STATUS_SITEMAP and update the crawl frequency. If anyone has any suggestions in terms of design and approach, please let me know.
          Hide
          kkrugler Ken Krugler added a comment -

          Hi Tejas - I thought the current CC robots parsing code was already extracting the sitemap links. Or is the above comment ("modified the robots parsing code to extract the links to sitemap pages") a change to the current Nutch robots parsing code?

          I do remember thinking that the CC version would need to change to support multiple Sitemap links, even though it wasn't clear whether that was actually valid.

          – Ken

          Show
          kkrugler Ken Krugler added a comment - Hi Tejas - I thought the current CC robots parsing code was already extracting the sitemap links. Or is the above comment ("modified the robots parsing code to extract the links to sitemap pages") a change to the current Nutch robots parsing code? I do remember thinking that the CC version would need to change to support multiple Sitemap links, even though it wasn't clear whether that was actually valid. – Ken
          Hide
          tejasp Tejas Patil added a comment -

          Hi Ken,
          As the CC robots integration jira is not closed, I did this change is on the current trunk.

          I did not understand this ("CC version would need to change to support multiple Sitemap links"). Do you mean that CC aint allowing multiple sitemap links in a robots file (like this) or sitemap index file ?

          Show
          tejasp Tejas Patil added a comment - Hi Ken, As the CC robots integration jira is not closed, I did this change is on the current trunk. I did not understand this ("CC version would need to change to support multiple Sitemap links"). Do you mean that CC aint allowing multiple sitemap links in a robots file (like this ) or sitemap index file ?
          Hide
          kkrugler Ken Krugler added a comment -

          Hi Tejas - the original code didn't, but I checked and now remember that I added support for multiple sitemap URLs to BaseRobotRules in CC.

          Show
          kkrugler Ken Krugler added a comment - Hi Tejas - the original code didn't, but I checked and now remember that I added support for multiple sitemap URLs to BaseRobotRules in CC.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Hi Tejas,
          thanks and a few comments on the patch:

          for a given host, sitemaps are processed just once” But they are not cached over cycles because the cache is bound to the protocol object. Is this correct? So a sitemap is fetched and processed every cycle for every host? If yes and sitemaps are large (they can!) this would cause a lot of extra traffic.

          Shouldn't sitemap URLs handled the same way as any other URL: add them to CrawlDb, fetch and parse once, add found links to CrawlDb, cf. Ken's post at CC. There are some complications:

          • due to their size, sitemaps may require larger values regarding size and time limits
          • sitemaps may require more frequent re-fetching (eg. by MimeAdaptiveFetchSchedule)
          • the current Outlink class cannot hold extra information contained in sitemaps (lastmod, changefreq, etc.)

          There is another way which we use it for several customers: A SitemapInjector fetches the sitemaps, extracts URLs and injects them with all extra information. It's a simple use case for a customized site-search: there is a sitemap and it shall be used as seed list or even exclusive list of documents to be crawled. Is there any interest in this solution? It's not a general solution and not adaptable to a large web crawl.

          Show
          wastl-nagel Sebastian Nagel added a comment - Hi Tejas, thanks and a few comments on the patch: “ for a given host, sitemaps are processed just once ” But they are not cached over cycles because the cache is bound to the protocol object. Is this correct? So a sitemap is fetched and processed every cycle for every host? If yes and sitemaps are large (they can!) this would cause a lot of extra traffic. Shouldn't sitemap URLs handled the same way as any other URL: add them to CrawlDb, fetch and parse once, add found links to CrawlDb, cf. Ken's post at CC . There are some complications: due to their size, sitemaps may require larger values regarding size and time limits sitemaps may require more frequent re-fetching (eg. by MimeAdaptiveFetchSchedule) the current Outlink class cannot hold extra information contained in sitemaps (lastmod, changefreq, etc.) There is another way which we use it for several customers: A SitemapInjector fetches the sitemaps, extracts URLs and injects them with all extra information. It's a simple use case for a customized site-search: there is a sitemap and it shall be used as seed list or even exclusive list of documents to be crawled. Is there any interest in this solution? It's not a general solution and not adaptable to a large web crawl.
          Hide
          tejasp Tejas Patil added a comment - - edited

          Hi Sebastian,

          By (“for a given host, sitemaps are processed just once”) I meant : in the same round, the processing is done just once for a given host. I agree with you that a sitemap is fetched and processed every cycle for every host. The SitemapInjector idea is good.

          The way I see this: "SitemapInjector" will be a

          • Separate map-reduce job
          • Responsible for fetching sitemap location(s) from robots file, getting the sitemap file(s) and adding the urls (along with the crawl freq. etc meta) from sitemap to the crawldb.
          • For large web crawls, we dont want to run this job for every nutch cycle. Also, new hosts will be discovered on the way for which the sitemaps need to be added to the crawldb. For those host, for which sitemaps were already processed, it might be possible that new sitemap location is been added to the robots file. So have a "sitemapFrequency" param to the crawl script. eg. If sitemapFrequency=10, sitemap job will be invoked in every 10 cycles of nutch crawl (1st cycle, 11th cycle, 21st cycle and so on).
          • Users can also run this job in standalone fashion on a crawldb.

          What say ?

          Show
          tejasp Tejas Patil added a comment - - edited Hi Sebastian, By (“for a given host, sitemaps are processed just once”) I meant : in the same round, the processing is done just once for a given host. I agree with you that a sitemap is fetched and processed every cycle for every host. The SitemapInjector idea is good. The way I see this: "SitemapInjector" will be a Separate map-reduce job Responsible for fetching sitemap location(s) from robots file, getting the sitemap file(s) and adding the urls (along with the crawl freq. etc meta) from sitemap to the crawldb. For large web crawls, we dont want to run this job for every nutch cycle. Also, new hosts will be discovered on the way for which the sitemaps need to be added to the crawldb. For those host, for which sitemaps were already processed, it might be possible that new sitemap location is been added to the robots file. So have a "sitemapFrequency" param to the crawl script. eg. If sitemapFrequency=10, sitemap job will be invoked in every 10 cycles of nutch crawl (1st cycle, 11th cycle, 21st cycle and so on). Users can also run this job in standalone fashion on a crawldb. What say ?
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Yes, SitemapInjector is a map-reduce job. The scenario for its use is the following:

          • a small set of sites to be crawled (eg, to feed a site-search index)
          • you can think of sitemaps as "remote seed lists". Because many content management systems can generate sitemaps it is convenient for the site owners to publish seeds. The URLs contained in the sitemap can be also the complete and exclusive set of URLs to be crawled (you can use the plugin scoring-depth to limit the crawl to seed URLs).
          • because you can trust in the sitemap's content
            • checks for "cross submissions" are not necessary
            • extra information (lastmod, changefreq, priority) can be used
              That's we use sitemaps: remote seed lists, maintained by customers, quite convenient if you run a crawler as a service.

          For large web crawls there is also another aspect: detection of sitemaps which is bound to processing of robots.txt. Processing of sitemaps can (and should?) be done the usual Nutch way:

          • detection is done in the protocol plugin (see Tejas' patch)
          • record in CrawlDb: done by Fetcher (cross submission information can be added)
          • fetch (if not yet done), parse (a plugin parse-sitemap based on crawler-commons?) and extract outlinks: sitemaps may require special treatment here because they can be large in size and usually contain many outlinks. Also the Outlink class needs to be extended to deal with the extra info relevant for scheduling
            To use an extra tool (as the SitemapInjector) for processing the sitemaps has the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On the contrary, special treatment can easily be realized in a separate map-reduce job.

          Comments?!

          Thanks, Tejas: the feature is moving forward thanks to your initiative!

          Show
          wastl-nagel Sebastian Nagel added a comment - Yes, SitemapInjector is a map-reduce job. The scenario for its use is the following: a small set of sites to be crawled (eg, to feed a site-search index) you can think of sitemaps as "remote seed lists". Because many content management systems can generate sitemaps it is convenient for the site owners to publish seeds. The URLs contained in the sitemap can be also the complete and exclusive set of URLs to be crawled (you can use the plugin scoring-depth to limit the crawl to seed URLs). because you can trust in the sitemap's content checks for "cross submissions" are not necessary extra information (lastmod, changefreq, priority) can be used That's we use sitemaps: remote seed lists, maintained by customers, quite convenient if you run a crawler as a service. For large web crawls there is also another aspect: detection of sitemaps which is bound to processing of robots.txt. Processing of sitemaps can (and should?) be done the usual Nutch way: detection is done in the protocol plugin (see Tejas' patch) record in CrawlDb: done by Fetcher (cross submission information can be added) fetch (if not yet done), parse (a plugin parse-sitemap based on crawler-commons?) and extract outlinks: sitemaps may require special treatment here because they can be large in size and usually contain many outlinks. Also the Outlink class needs to be extended to deal with the extra info relevant for scheduling To use an extra tool (as the SitemapInjector) for processing the sitemaps has the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On the contrary, special treatment can easily be realized in a separate map-reduce job. Comments?! Thanks, Tejas: the feature is moving forward thanks to your initiative!
          Hide
          markus17 Markus Jelsma added a comment -

          Thanks all for your interesting comments.

          It's a complicated issue. One one hand host data should be stored in NUTCH-1325 but that would require additional logic and sending each segment output to the hostdb, in case there's a sitemap crawled. On the other hand it's ideal to store host data. It's also easy to use in jobs such as the indexer and generator.

          I don't yet favour a specific approach but storing sitemap data in a hostdb may be something to think about.

          Cheers

          Show
          markus17 Markus Jelsma added a comment - Thanks all for your interesting comments. It's a complicated issue. One one hand host data should be stored in NUTCH-1325 but that would require additional logic and sending each segment output to the hostdb, in case there's a sitemap crawled. On the other hand it's ideal to store host data. It's also easy to use in jobs such as the indexer and generator. I don't yet favour a specific approach but storing sitemap data in a hostdb may be something to think about. Cheers
          Hide
          tejasp Tejas Patil added a comment -

          Hi Sebastian,

          So we are looking at 2 things here:

          • a standalone utility for injecting sitemaps to crawldb:
            1. User starts off with urls to sitemap pages
            2. SitemapInjector fetches these seeds, parses it (with a parse plugin based on CC)
            3. SitemapInjector updates the crawldb with the sitemap entries.
          • handling of sitemap within the nutch cycle: fetch, parse and update phases
            1. Robots parsing will populate a table of "host": <list of links to sitemap pages>
            2. These will be added to the fetcher queue and will be fetched
            3. A parser plugin based on CC will parse the sitemap page
            4. Outlink class needs to be extended to store the meta obtained from sitemap
            5. Write this into the segment
            6. Update phase needs to update the crawl frequency of already existing urls in crawldb based on what we got from the sitemap. Else just add new entires to the crawldb.

          I am not clear about the extending outlink thing. The normal outlink extraction need not be done as CC will already do that for us. Sitemap parser plugin must do this and create objects of our specialized sitemap link. While writing, where is CrawlDatum generated from the outlink ?

          The mime type that we get is "text/xml" which can also mean a normal xml file. How will nutch identify if its a sitemap page and invoke the correct parser plugin ? (I know that this magic is done by feed parser but not sure which part of code is doing that. Just point me to that code).

          Show
          tejasp Tejas Patil added a comment - Hi Sebastian, So we are looking at 2 things here: a standalone utility for injecting sitemaps to crawldb: User starts off with urls to sitemap pages SitemapInjector fetches these seeds, parses it (with a parse plugin based on CC) SitemapInjector updates the crawldb with the sitemap entries. handling of sitemap within the nutch cycle: fetch, parse and update phases Robots parsing will populate a table of "host": < list of links to sitemap pages > These will be added to the fetcher queue and will be fetched A parser plugin based on CC will parse the sitemap page Outlink class needs to be extended to store the meta obtained from sitemap Write this into the segment Update phase needs to update the crawl frequency of already existing urls in crawldb based on what we got from the sitemap. Else just add new entires to the crawldb. I am not clear about the extending outlink thing. The normal outlink extraction need not be done as CC will already do that for us. Sitemap parser plugin must do this and create objects of our specialized sitemap link. While writing, where is CrawlDatum generated from the outlink ? The mime type that we get is "text/xml" which can also mean a normal xml file. How will nutch identify if its a sitemap page and invoke the correct parser plugin ? (I know that this magic is done by feed parser but not sure which part of code is doing that. Just point me to that code).
          Hide
          brian44 Brian added a comment -

          Is a separate issue needed for support in 2.X?

          Show
          brian44 Brian added a comment - Is a separate issue needed for support in 2.X?
          Hide
          tejasp Tejas Patil added a comment -

          Revisited this Jira after a long time and gave a thought how this can be done cleanly. Two ways for implementing this:

          (A) Do the sitemap stuff in the fetch phase of nutch cycle.
          This was my original approach which the (in-progress) patch addresses. This would involve tweaking core nutch classes at several locations.

          Pros:

          • Sitemaps are nothing but normal pages with several outlinks. Fits well in the 'fetch' cycle.

          Cons:

          • Sitemaps can be very huge in size. Fetching them need large size and time limits. Fetch code must have a special case to take into account that the url is a sitemap url and use custom limits => leads to hacky coding style.
          • Outlink class cannot hold extra information contained in sitemaps (like lastmod, changefreq). Modify it to hold this information too. This would be specific for sitemaps only yet we end up making all outlinks to hold this info. We could create a special type of outlink and take care of this.

          (B) Have separate job for the sitemap stuff and merge its output into the crawldb.
          i. User populates a list of hosts (or uses HostDB from NUTCH-1325). Now we got all the hosts to be processed.
          ii. Run a map-reduce job: for each host,

          • get the robots page, extract sitemap urls,
          • get xml content of these sitemap pages
          • create crawl datums with the requried info and write this to a sitemapDB

          iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb

          Pros:

          • Cleaner code.
          • Users have control when to perform sitemap extraction. This is better than (A) wherein sitemap urls are sitting in the crawldb and get fetched along with normal pages (thus, eating up fetch time of every fetch phase). We can have a sitemap_fequency used insdie the crawl script so that users say that after 'x' nutch cycles, run sitemap processing.

          Cons:

          • Additional map-reduce jobs are needed. I think that this must be reasonable. Running sitemap job 1-5 times in a month on a production level crawl would work out well.

          I am inclined towards implementing (B)

          Show
          tejasp Tejas Patil added a comment - Revisited this Jira after a long time and gave a thought how this can be done cleanly. Two ways for implementing this: (A) Do the sitemap stuff in the fetch phase of nutch cycle. This was my original approach which the (in-progress) patch addresses. This would involve tweaking core nutch classes at several locations. Pros: Sitemaps are nothing but normal pages with several outlinks. Fits well in the 'fetch' cycle. Cons: Sitemaps can be very huge in size. Fetching them need large size and time limits. Fetch code must have a special case to take into account that the url is a sitemap url and use custom limits => leads to hacky coding style. Outlink class cannot hold extra information contained in sitemaps (like lastmod, changefreq). Modify it to hold this information too. This would be specific for sitemaps only yet we end up making all outlinks to hold this info. We could create a special type of outlink and take care of this. (B) Have separate job for the sitemap stuff and merge its output into the crawldb. i. User populates a list of hosts (or uses HostDB from NUTCH-1325 ). Now we got all the hosts to be processed. ii. Run a map-reduce job: for each host, get the robots page, extract sitemap urls, get xml content of these sitemap pages create crawl datums with the requried info and write this to a sitemapDB iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb Pros: Cleaner code. Users have control when to perform sitemap extraction. This is better than (A) wherein sitemap urls are sitting in the crawldb and get fetched along with normal pages (thus, eating up fetch time of every fetch phase). We can have a sitemap_fequency used insdie the crawl script so that users say that after 'x' nutch cycles, run sitemap processing. Cons: Additional map-reduce jobs are needed. I think that this must be reasonable. Running sitemap job 1-5 times in a month on a production level crawl would work out well. I am inclined towards implementing (B)
          Hide
          lewismc Lewis John McGibbney added a comment -

          Hi Tejas Patil... nice logic.
          Some notes here from my observations of the crawler commons code (and possibly sitemap standards as well)

              /** According to the specs, 50K URLs per Sitemap is the max */
              private static final int MAX_URLS = 50000;
          
              /** Sitemap docs must be limited to 10MB (10,485,760 bytes) */
              public static int MAX_BYTES_ALLOWED = 10485760;
          

          I would be inclined to agree with you on your preference to introduce the new MR SiteMapMRJob as in B above. It generally sounds much much cleaner, with the changes being less sporadic hence affecting less areas of the existing codebase.
          Also, given the the HostDB has been coming along nicely in 1.X I think this would be an excellent use of the CC SiteMap code.

          Show
          lewismc Lewis John McGibbney added a comment - Hi Tejas Patil ... nice logic. Some notes here from my observations of the crawler commons code (and possibly sitemap standards as well) /** According to the specs, 50K URLs per Sitemap is the max */ private static final int MAX_URLS = 50000; /** Sitemap docs must be limited to 10MB (10,485,760 bytes) */ public static int MAX_BYTES_ALLOWED = 10485760; I would be inclined to agree with you on your preference to introduce the new MR SiteMapMRJob as in B above. It generally sounds much much cleaner, with the changes being less sporadic hence affecting less areas of the existing codebase. Also, given the the HostDB has been coming along nicely in 1.X I think this would be an excellent use of the CC SiteMap code.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Hi Tejas,
          attached you'll find a patch for a sitemap injector. Originally written by Hannes Schwarz, it's used by use for a couple of time. The patch contains a revised and improved version which, however, needs some more work (see TODOs in code).
          The use case is somewhat different from way B: The sitemap injector takes URLs of sitemaps (not via robots.txt) and injects them directly to CrawlDb (no extra sitemapDB - do we really need an extra DB?). Robots.txt is not used as an intermediate step/hop because experience has shown that often customers prepare a special sitemap for the site search crawler which differs from the sitemap propagated in robots.txt.
          Btw., NUTCH-1622 would enable solution A: outlinks now can hold extra info.

          Show
          wastl-nagel Sebastian Nagel added a comment - Hi Tejas, attached you'll find a patch for a sitemap injector. Originally written by Hannes Schwarz, it's used by use for a couple of time. The patch contains a revised and improved version which, however, needs some more work (see TODOs in code). The use case is somewhat different from way B: The sitemap injector takes URLs of sitemaps (not via robots.txt) and injects them directly to CrawlDb (no extra sitemapDB - do we really need an extra DB?). Robots.txt is not used as an intermediate step/hop because experience has shown that often customers prepare a special sitemap for the site search crawler which differs from the sitemap propagated in robots.txt. Btw., NUTCH-1622 would enable solution A: outlinks now can hold extra info.
          Hide
          tejasp Tejas Patil added a comment - - edited

          Hi Sebastian Nagel,

          Nice share. The only grudge I have with that approach is that users will have to pick up sitemap urls for hosts manually and feed to the sitemap injector. It would fit well where users are performing targeted crawling.
          For a large scale, open web crawl use case:
          i) the number of initial hosts can be large : one time burden for users
          ii) crawler discovers new hosts with time : constant pain for users to look out for the new hosts discovered and then get sitemaps from robots.txt manually. With HostDB from NUTCH-1325 and B, users won't suffer here.

          > do we really need an extra DB?
          I should have been clear with the explanation. "sitemapDB" is some temporary location where all crawl datums of sitemap entries would be written. This can be deleted after merge with the main crawlDB. Quite analogous to what inject operation does.

          > NUTCH-1622 would enable solution A: outlinks now can hold extra info.
          I didn't knew that. Still I would go in favor of B as it is clean and A would involve messing around with existing codebase at several places.

          Show
          tejasp Tejas Patil added a comment - - edited Hi Sebastian Nagel , Nice share. The only grudge I have with that approach is that users will have to pick up sitemap urls for hosts manually and feed to the sitemap injector. It would fit well where users are performing targeted crawling. For a large scale, open web crawl use case: i) the number of initial hosts can be large : one time burden for users ii) crawler discovers new hosts with time : constant pain for users to look out for the new hosts discovered and then get sitemaps from robots.txt manually. With HostDB from NUTCH-1325 and B, users won't suffer here. > do we really need an extra DB? I should have been clear with the explanation. "sitemapDB" is some temporary location where all crawl datums of sitemap entries would be written. This can be deleted after merge with the main crawlDB. Quite analogous to what inject operation does. > NUTCH-1622 would enable solution A: outlinks now can hold extra info. I didn't knew that. Still I would go in favor of B as it is clean and A would involve messing around with existing codebase at several places.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Let's add use case C:
          (C) inject URLs from given sitemap(s)
          i. user configures list of known and trusted sitemaps
          ii. URLs are extracted from sitemaps and injected into CrawlDb
          Use case: small/medium size customized crawls

          Is C a common use case, worth to be integrated?

          Show
          wastl-nagel Sebastian Nagel added a comment - Let's add use case C: (C) inject URLs from given sitemap(s) i. user configures list of known and trusted sitemaps ii. URLs are extracted from sitemaps and injected into CrawlDb Use case: small/medium size customized crawls Is C a common use case, worth to be integrated?
          Hide
          tejasp Tejas Patil added a comment -

          Hi Sebastian Nagel,
          Yes. I think that it should be there too. I will be working on the patch this weekend and update on the same. Thanks for your inputs and suggestions till now in, were super helpful in chalking out the right specs for this feature.

          Show
          tejasp Tejas Patil added a comment - Hi Sebastian Nagel , Yes. I think that it should be there too. I will be working on the patch this weekend and update on the same. Thanks for your inputs and suggestions till now in, were super helpful in chalking out the right specs for this feature.
          Hide
          tejasp Tejas Patil added a comment -

          Attaching NUTCH-1465-trunk.v2.patch which has implementation of option (B) Have separate job for the sitemap stuff and merge its output into the crawldb

          I have tied both the cases in this patch:
          1. users with targeted crawl who want to get sitemaps injected from a list of sitemap urls - the use case which Sebastian Nagel had pointed out.
          2. large open web crawls where users cannot afford to generate sitemap seeds for all the hosts and want nutch to inject sitemaps automatically.

          To try out this patch:
          1. Apply the patch for HostDb feature (https://issues.apache.org/jira/secure/attachment/12624178/NUTCH-1325-trunk-v4.patch)
          2. Apply this patch (NUTCH-1465-trunk.v2.patch)
          3. (optional) Add this to conf/log4j.properties at line 11:

          log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout
          

          3. Run using

          bin/nutch org.apache.nutch.util.SitemapProcessor
          

          I have started working on a wiki page describing this feature: https://wiki.apache.org/nutch/SitemapFeature

          Any suggestion and comments are welcome.

          Show
          tejasp Tejas Patil added a comment - Attaching NUTCH-1465 -trunk.v2.patch which has implementation of option (B) Have separate job for the sitemap stuff and merge its output into the crawldb I have tied both the cases in this patch: 1. users with targeted crawl who want to get sitemaps injected from a list of sitemap urls - the use case which Sebastian Nagel had pointed out. 2. large open web crawls where users cannot afford to generate sitemap seeds for all the hosts and want nutch to inject sitemaps automatically. To try out this patch: 1. Apply the patch for HostDb feature ( https://issues.apache.org/jira/secure/attachment/12624178/NUTCH-1325-trunk-v4.patch ) 2. Apply this patch ( NUTCH-1465 -trunk.v2.patch) 3. (optional) Add this to conf/log4j.properties at line 11: log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout 3. Run using bin/nutch org.apache.nutch.util.SitemapProcessor I have started working on a wiki page describing this feature: https://wiki.apache.org/nutch/SitemapFeature Any suggestion and comments are welcome.
          Hide
          tejasp Tejas Patil added a comment -

          Now that HostDb (NUTCH-1365) is in trunk, updated the patch (v3).
          Also,

          • included job counters
          • more documentation
          • added sitemap references in log4j.properties and bin/nutch script.

          For usage, see https://wiki.apache.org/nutch/SitemapFeature

          Show
          tejasp Tejas Patil added a comment - Now that HostDb ( NUTCH-1365 ) is in trunk, updated the patch (v3). Also, included job counters more documentation added sitemap references in log4j.properties and bin/nutch script. For usage, see https://wiki.apache.org/nutch/SitemapFeature
          Hide
          lewismc Lewis John McGibbney added a comment - - edited

          Hey Tejas Patil. Again, great work! Some minor comments

          • Class level Javadoc in SitemapProcessor would be more legible if it used format something similar to
            SitemapProcessor.java
            /**
             * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging
             * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
             *
             * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
             * <ol>
             *  <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a
             *     list of sitemap links and get only those sitemap pages. This suits well for targeted
             *     crawl of specific hosts.</li>
             *  <li>For open web crawl, it is not possible to track each host and get the sitemap links
             *     manually. Nutch would automatically get the sitemaps for all the hosts seen in the
             *     crawls and inject the urls from sitemap to the crawldb.</li>
             * </ol>
             * <p>For more details see:
             *      https://wiki.apache.org/nutch/SitemapFeature </o>
             */
            
          • I think that the following logging line should be changed to WARN or ERROR
            SitemapProcessor.java
            } catch (Exception e) {
            +          LOG.info("Exception for url " + key.toString() + " : " + StringUtils.stringifyException(e)); 
            
          • This is merely a suggestion, but in SitemapProcessor#filterNormalize(String u), could we not use one of the methods from URLUtil.java instead?
            SitemapProcessor.java
                  if(!u.startsWith("http://") && !u.startsWith("https://")) {
                    // We received a hostname here so let's make a URL
                    url = "http://" + u + "/";
                    isHost = true;
                  }
            

          Thats about it from me mate. This looks like an excellent addition to Nutch again. I made a trvial update to the wiki page to drop in some links and background to your work on this one.

          I should probably add, on local tests this works fine for me. E.g. injecting sitemap file and from Hostdb.

          Show
          lewismc Lewis John McGibbney added a comment - - edited Hey Tejas Patil . Again, great work! Some minor comments Class level Javadoc in SitemapProcessor would be more legible if it used format something similar to SitemapProcessor.java /** * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging * the urls from Sitemap (with the metadata) with the existing crawldb.</p> * * <p>There are two use cases supported in Nutch's Sitemap processing:</p> * <ol> * <li>Sitemaps are considered as "remote seed lists" . Crawl administrators can prepare a * list of sitemap links and get only those sitemap pages. This suits well for targeted * crawl of specific hosts.</li> * <li>For open web crawl, it is not possible to track each host and get the sitemap links * manually. Nutch would automatically get the sitemaps for all the hosts seen in the * crawls and inject the urls from sitemap to the crawldb.</li> * </ol> * <p>For more details see: * https: //wiki.apache.org/nutch/SitemapFeature </o> */ I think that the following logging line should be changed to WARN or ERROR SitemapProcessor.java } catch (Exception e) { + LOG.info( "Exception for url " + key.toString() + " : " + StringUtils.stringifyException(e)); This is merely a suggestion, but in SitemapProcessor#filterNormalize(String u), could we not use one of the methods from URLUtil.java instead? SitemapProcessor.java if (!u.startsWith( "http: //" ) && !u.startsWith( "https://" )) { // We received a hostname here so let's make a URL url = "http: //" + u + "/" ; isHost = true ; } Thats about it from me mate. This looks like an excellent addition to Nutch again. I made a trvial update to the wiki page to drop in some links and background to your work on this one. I should probably add, on local tests this works fine for me. E.g. injecting sitemap file and from Hostdb.
          Hide
          tejasp Tejas Patil added a comment -

          Hi Lewis John McGibbney,
          +1 for the first two suggestions. For #3: I skimmed through the methods inside URLUtil.java and nothing came to my notice that I could use in the Sitemap code you pointed. Can you please confirm ?

          A big thanks mate for trying out the feature. Hopefully we get this into 1.8 release.
          Cheers !!

          Show
          tejasp Tejas Patil added a comment - Hi Lewis John McGibbney , +1 for the first two suggestions. For #3: I skimmed through the methods inside URLUtil.java and nothing came to my notice that I could use in the Sitemap code you pointed. Can you please confirm ? A big thanks mate for trying out the feature. Hopefully we get this into 1.8 release. Cheers !!
          Hide
          lewismc Lewis John McGibbney added a comment -

          hey Tejas Patil no probs. RE: #3, I was just curious to see if we could reuse some of the method we had in URLUtil. Now that I've looked I feel you're right.
          This patch reminds me of pushing out to filtering and normalization to crawler commons anyway but that is another can of worms
          I'll let others comments here. Right now I am +1 on this patch.

          Show
          lewismc Lewis John McGibbney added a comment - hey Tejas Patil no probs. RE: #3, I was just curious to see if we could reuse some of the method we had in URLUtil. Now that I've looked I feel you're right. This patch reminds me of pushing out to filtering and normalization to crawler commons anyway but that is another can of worms I'll let others comments here. Right now I am +1 on this patch.
          Hide
          tejasp Tejas Patil added a comment -

          Attaching v4 patch with the suggestions #1 and #2 from Lewis John McGibbney.

          Show
          tejasp Tejas Patil added a comment - Attaching v4 patch with the suggestions #1 and #2 from Lewis John McGibbney .
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Great, looks good and is a really compact providing a lot of functionality. I've just started to test SitemapProcessor, here my first comments:

          • SitemapProcessor.java has no Apache license header
          • would be nice to see counters in log output
          • regarding Lewis' point #3: doesn't a comment "a hacky way" mean: "try to avoid that"? Why not set isHost inside map(...) by isHost = (value instanceof HostDatum) and pass it as parameter to filterNormalize()? This would avoid any errors due to incomplete heuristics, here when testing with sitemaps accessed per file protocol:
            INFO  api.HttpRobotRulesParser - Couldn't get robots.txt for http://file:/tmp/sitemap1.xml/: java.net.UnknownHostException: file
            
          • concurrency: "returning" the value of isHost from filterNormalize() to map() per member variable is not thread-safe and will cause problems in combination with MultithreadedMapper. One argument more to pass it from map() to filterNormalize() per parameter.
          Show
          wastl-nagel Sebastian Nagel added a comment - Great, looks good and is a really compact providing a lot of functionality. I've just started to test SitemapProcessor, here my first comments: SitemapProcessor.java has no Apache license header would be nice to see counters in log output regarding Lewis' point #3: doesn't a comment "a hacky way" mean: "try to avoid that"? Why not set isHost inside map(...) by isHost = (value instanceof HostDatum) and pass it as parameter to filterNormalize()? This would avoid any errors due to incomplete heuristics, here when testing with sitemaps accessed per file protocol: INFO api.HttpRobotRulesParser - Couldn't get robots.txt for http: //file:/tmp/sitemap1.xml/: java.net.UnknownHostException: file concurrency: "returning" the value of isHost from filterNormalize() to map() per member variable is not thread-safe and will cause problems in combination with MultithreadedMapper. One argument more to pass it from map() to filterNormalize() per parameter.
          Hide
          tejasp Tejas Patil added a comment -

          Hi Sebastian Nagel,
          Thanks a lot for your comments. First two were straight forward and I agree with those.

          Re "hacky way" : For hosts from the HostDb, we don't know which protocol they below to. In the code I was checking if http:// is a match and if that was a bad guess then try with https://. I didn't handle for ftp:// and file:/ schemes. By "hacky" I meant this approach of trial-and-error till a suitable match is formed and we create a homepage url for the host. I have thought of your comment and would have a better (yet hacky) way in the coming patch.

          Re "concurrency": I had thought of this and had searched over internet for internals of MultithreadedMapper. All I could get is that it has an internal thread pool and each input record to handed over to a thread in this pool to run map() over it. I wrote this code to check if thread safety was ensured in MultithreadedMapper:

            private static class SitemapMapper extends Mapper<Text, Writable, Text, CrawlDatum> {
              private String myurl = null;
          
              public void map(Text key, Writable value, Context context) throws IOException, InterruptedException {
                if (value instanceof Text) {
                  String url = key.toString();
                  if(foo(url).compareTo(url) != 0) {
                    LOG.warn("Race condition found !!!");
                  }
                }
              }
          
              private String foo(String url) {
                myurl = url;
                if(Thread.currentThread().getId() % 2 == 1) {
                  try {
                    Thread.sleep(10000);
                  } catch(InterruptedException e) {
                    LOG.warn(e.getMessage());
                  }
                }
                return myurl;
              }
          

          I ran it multiple times with threads set to 10, 100, 1000 and 2000 but never hit the race condition in the code. Is the code snippet above a good way to reveal any race condition in the code ? Its won't be a formal conclusion and more of an experimental conclusion. How do I get a concrete conclusion whether MultithreadedMapper is thread safe or not ?

          Show
          tejasp Tejas Patil added a comment - Hi Sebastian Nagel , Thanks a lot for your comments. First two were straight forward and I agree with those. Re "hacky way" : For hosts from the HostDb, we don't know which protocol they below to. In the code I was checking if http:// is a match and if that was a bad guess then try with https:// . I didn't handle for ftp:// and file:/ schemes. By "hacky" I meant this approach of trial-and-error till a suitable match is formed and we create a homepage url for the host. I have thought of your comment and would have a better (yet hacky) way in the coming patch. Re "concurrency": I had thought of this and had searched over internet for internals of MultithreadedMapper. All I could get is that it has an internal thread pool and each input record to handed over to a thread in this pool to run map() over it. I wrote this code to check if thread safety was ensured in MultithreadedMapper: private static class SitemapMapper extends Mapper<Text, Writable, Text, CrawlDatum> { private String myurl = null; public void map(Text key, Writable value, Context context) throws IOException, InterruptedException { if (value instanceof Text) { String url = key.toString(); if(foo(url).compareTo(url) != 0) { LOG.warn("Race condition found !!!"); } } } private String foo(String url) { myurl = url; if(Thread.currentThread().getId() % 2 == 1) { try { Thread.sleep(10000); } catch(InterruptedException e) { LOG.warn(e.getMessage()); } } return myurl; } I ran it multiple times with threads set to 10, 100, 1000 and 2000 but never hit the race condition in the code. Is the code snippet above a good way to reveal any race condition in the code ? Its won't be a formal conclusion and more of an experimental conclusion. How do I get a concrete conclusion whether MultithreadedMapper is thread safe or not ?
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Sorry, you're right: the comment "hacky way" applies to trying http and https to check which host-URL would pass the filters. That's ok, there is no better solution for that.
          But what about the decision whether a string passed to filterNormalize() is a host from HostDb or a URL from a list of sitemaps? This decision could be made without any heuristics: inside map() we know the type (host or sitemap Url) from the class of the value:

          boolean isHost = (value instanceof HostDatum);
          String url = filterNormalize(key.toString(), isHost);
          

          The method filterNormalize() could be then simplified and the member variable isHost would be obsolete.
          Regarding concurrency: the javadoc of [MultithreadedMapper.java] states that "Mapper implementations using this MapRunnable must be thread-safe." In doubt, it may be better to follow this advice and not to look at the (current) implementation. If SitemapParser is thread-safe (at a first glance, it is) it should be easy to get SitemapMapper safe.

          Show
          wastl-nagel Sebastian Nagel added a comment - Sorry, you're right: the comment "hacky way" applies to trying http and https to check which host-URL would pass the filters. That's ok, there is no better solution for that. But what about the decision whether a string passed to filterNormalize() is a host from HostDb or a URL from a list of sitemaps? This decision could be made without any heuristics: inside map() we know the type (host or sitemap Url) from the class of the value: boolean isHost = (value instanceof HostDatum); String url = filterNormalize(key.toString(), isHost); The method filterNormalize() could be then simplified and the member variable isHost would be obsolete. Regarding concurrency: the javadoc of [ MultithreadedMapper.java ] states that "Mapper implementations using this MapRunnable must be thread-safe." In doubt, it may be better to follow this advice and not to look at the (current) implementation. If SitemapParser is thread-safe (at a first glance, it is) it should be easy to get SitemapMapper safe.
          Hide
          tejasp Tejas Patil added a comment -

          Adding new patch 'v5' with below changes:
          1. Added Apache license header as per review comment by Sebastian Nagel
          2. Added counters in log output as per review comment by Sebastian Nagel
          3. Implemented the change suggested by Sebastian Nagel for 'isHost' and 'filterNormalize'. I could do more re-factoring and make it more clean.
          4. Added a new parameter "-noStrict" to control the checking done by sitemap parser

          Show
          tejasp Tejas Patil added a comment - Adding new patch 'v5' with below changes: 1. Added Apache license header as per review comment by Sebastian Nagel 2. Added counters in log output as per review comment by Sebastian Nagel 3. Implemented the change suggested by Sebastian Nagel for 'isHost' and 'filterNormalize'. I could do more re-factoring and make it more clean. 4. Added a new parameter "-noStrict" to control the checking done by sitemap parser
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Thanks, Tejas Patil for the improvements! Testings continued...

          Sitemaps are treated same as ordinary URLs/docs. But there are some differences. Shouldn't we relax default limits and filters and trust the restrictions specified in sitemap protocol?

          • URL filters and normalizers: maybe you want to exclude .gz docs per suffix filter but still fetch gzipped sitemaps. That's not possible. Is it really necessary to normalize/filter sitemap URLs? If yes, this should be optional.
          • default content limits {http,ftp,file}

            .content.limit (64 kB) are quite small even for mid-size sitemaps. Ok, you could set it per -D... but why not increase it to SiteMapParser.MAX_BYTES_ALLOWED?

          • maybe we want also increase the fetch timeout

          Processing siitemap indexes fails:

          • the check sitemap.isIndex() skips all referenced sitemaps
          • protocol for sitemap index and referenced sub-sitemaps may be different (eg., one sub-sitemap could be https while others are http)
          • if processing one of the referenced sitemaps fails, the remaining sub-sitemaps are not processed

          Fetch intervals are taken unchecked from <changefreq>. Should we llimit them to reasonable values (db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max). Fetch intervals of 1 second or 1 hour may cause troubles. [1] explicitely says that <changefreq> "is considered a hint and not a command".

          Show
          wastl-nagel Sebastian Nagel added a comment - Thanks, Tejas Patil for the improvements! Testings continued... Sitemaps are treated same as ordinary URLs/docs. But there are some differences. Shouldn't we relax default limits and filters and trust the restrictions specified in sitemap protocol? URL filters and normalizers: maybe you want to exclude .gz docs per suffix filter but still fetch gzipped sitemaps. That's not possible. Is it really necessary to normalize/filter sitemap URLs? If yes, this should be optional. default content limits {http,ftp,file} .content.limit (64 kB) are quite small even for mid-size sitemaps. Ok, you could set it per -D... but why not increase it to SiteMapParser.MAX_BYTES_ALLOWED? maybe we want also increase the fetch timeout Processing siitemap indexes fails: the check sitemap.isIndex() skips all referenced sitemaps protocol for sitemap index and referenced sub-sitemaps may be different (eg., one sub-sitemap could be https while others are http) if processing one of the referenced sitemaps fails, the remaining sub-sitemaps are not processed Fetch intervals are taken unchecked from <changefreq>. Should we llimit them to reasonable values (db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max). Fetch intervals of 1 second or 1 hour may cause troubles. [ 1 ] explicitely says that <changefreq> "is considered a hint and not a command".
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          SitemapReducer overwrites score, modified time, and fetch interval of existing CrawlDb entries with the values from sitemap. Is this the desired behavior? What about forgotten, hopeless outdated sitemap? Or bogus values (last mod in the future)?
          If a sitemap does not specify one of score, modified time, or fetch interval this values is set to zero. In this case, we should definitely not overwrite existing values. Newly added entries should get assigned db.fetch.interval.default and a reasonable score, eg. 0.5 as recommended by [2]. But that may depend on scoring plugins. Comments?

          Show
          wastl-nagel Sebastian Nagel added a comment - SitemapReducer overwrites score, modified time, and fetch interval of existing CrawlDb entries with the values from sitemap. Is this the desired behavior? What about forgotten, hopeless outdated sitemap? Or bogus values (last mod in the future)? If a sitemap does not specify one of score, modified time, or fetch interval this values is set to zero. In this case, we should definitely not overwrite existing values. Newly added entries should get assigned db.fetch.interval.default and a reasonable score, eg. 0.5 as recommended by [ 2 ]. But that may depend on scoring plugins. Comments?
          Hide
          tejasp Tejas Patil added a comment -

          Interesting comments Sebastian Nagel.

          Re "filters and normalizers" : By default I have kept those ON but can be disabled by using "-noFilter" and "-noNormalize".
          Re "default content limits" and "fetch timeout": +1. Agree with you.
          Re "Processing sitemap indexes fails" : +1. Nice catch.
          Re "Fetch intervals of 1 second or 1 hour may cause troubles" : Currently, Injector allows users to provide a custom fetch interval with any value eg. 1 sec. It makes sense not the correct it as user wants Nutch use that custom fetch interval. If we view sitemaps as custom seed list given by a content owner, then it would make sense to follow the intervals. But as you said that sitemaps can be wrongly set or outdated, the intervals might be incorrect. The question bolis down to: We are blindly accepting user's custom information in inject. Should we blindly assume that sitemaps are correct or not ? I have no strong opinion about either side of the argument.

          (PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 1 hr as per db.fetch.schedule.adaptive.min_interval <= interval)

          Re "SitemapReducer overwriting" :
          >> "If a sitemap does not specify one of score, modified time, or fetch interval this values is set to zero. "
          Nope. See SiteMapURL.java

          (a) score : Crawler commons assigns a default score of 0.5 if there was none provided in sitemap.
          We can do this: If an old entry has score other than 0.5, it can be preserved else update. For new entry, use scoring plugins for score equal to 0.5, else preserve the same.
          Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap or the default one if <changefreq> was absent.
          (b) fetch interval : Crawler commons does NOT set fetch interval if there was none provided in sitemap. So we are sure that whatever value is used is coming from <changefreq>. Validation might be needed as per comments above.
          (c) modified time : Same as fetch interval, unless parsed from sitemap file, modified time is set to NULL. Only possible validation is to drop values greater than current time.

          Show
          tejasp Tejas Patil added a comment - Interesting comments Sebastian Nagel . Re "filters and normalizers" : By default I have kept those ON but can be disabled by using "-noFilter" and "-noNormalize". Re "default content limits" and "fetch timeout": +1. Agree with you. Re "Processing sitemap indexes fails" : +1. Nice catch. Re "Fetch intervals of 1 second or 1 hour may cause troubles" : Currently, Injector allows users to provide a custom fetch interval with any value eg. 1 sec. It makes sense not the correct it as user wants Nutch use that custom fetch interval. If we view sitemaps as custom seed list given by a content owner, then it would make sense to follow the intervals. But as you said that sitemaps can be wrongly set or outdated, the intervals might be incorrect. The question bolis down to: We are blindly accepting user's custom information in inject. Should we blindly assume that sitemaps are correct or not ? I have no strong opinion about either side of the argument. (PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 1 hr as per db.fetch.schedule.adaptive.min_interval <= interval) Re "SitemapReducer overwriting" : >> "If a sitemap does not specify one of score, modified time, or fetch interval this values is set to zero. " Nope. See SiteMapURL.java (a) score : Crawler commons assigns a default score of 0.5 if there was none provided in sitemap. We can do this: If an old entry has score other than 0.5, it can be preserved else update. For new entry, use scoring plugins for score equal to 0.5, else preserve the same. Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap or the default one if <changefreq> was absent. (b) fetch interval : Crawler commons does NOT set fetch interval if there was none provided in sitemap. So we are sure that whatever value is used is coming from <changefreq>. Validation might be needed as per comments above. (c) modified time : Same as fetch interval, unless parsed from sitemap file, modified time is set to NULL. Only possible validation is to drop values greater than current time.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          "filters and normalizers": -noFilter is not really an option if sitemaps are used and gzipped documents (eg. software packages) shall be excluded. In customized crawls URL filter rules are often complex, and I want to avoid to have to sets of rules in the end. Sitemaps are different from normal docs/URLs (robots.txt is also different): they are not stored in CrawlDb and may require other filter rules. What about an option "-noFilterSitemap"?

          "Fetch intervals of 1 second or 1 hour may cause troubles":
          > We are blindly accepting user's custom information in inject.
          Yes, because the user (crawl administrator) can change the seed list (it's a file/directory on local disk or HDFS). Sitemaps are not necessarily under control of the user. If we (optionally) adjust fetch interval by (configurable) min/max limits that would help to get unreasonable values, and eg. re-fetch a bunch of pages every cycle.

          "SitemapReducer overwriting" :
          In a continuous crawl we know when pages are modified and have heuristics to estimate the change frequency of a page (AdaptiveFetchSchedule). The question is whether we trust those values which are achieved from crawling or prefer (possibly bogus) values from sitemaps. To use the sitemap values for new URLs found in sitemaps is less critical.

          > (a) score : Crawler commons assigns a default score of 0.5 if there was none provided in sitemap.
          Needs an upgrade of crawler-commons (0.2 is still used which sets priority to 0).

          Show
          wastl-nagel Sebastian Nagel added a comment - "filters and normalizers": -noFilter is not really an option if sitemaps are used and gzipped documents (eg. software packages) shall be excluded. In customized crawls URL filter rules are often complex, and I want to avoid to have to sets of rules in the end. Sitemaps are different from normal docs/URLs (robots.txt is also different): they are not stored in CrawlDb and may require other filter rules. What about an option "-noFilterSitemap"? "Fetch intervals of 1 second or 1 hour may cause troubles": > We are blindly accepting user's custom information in inject. Yes, because the user (crawl administrator) can change the seed list (it's a file/directory on local disk or HDFS). Sitemaps are not necessarily under control of the user. If we (optionally) adjust fetch interval by (configurable) min/max limits that would help to get unreasonable values, and eg. re-fetch a bunch of pages every cycle. "SitemapReducer overwriting" : In a continuous crawl we know when pages are modified and have heuristics to estimate the change frequency of a page (AdaptiveFetchSchedule). The question is whether we trust those values which are achieved from crawling or prefer (possibly bogus) values from sitemaps. To use the sitemap values for new URLs found in sitemaps is less critical. > (a) score : Crawler commons assigns a default score of 0.5 if there was none provided in sitemap. Needs an upgrade of crawler-commons (0.2 is still used which sets priority to 0).
          Hide
          tejasp Tejas Patil added a comment -

          Re "filters and normalizers": +1.

          Re "fetch intervals" and "reducer overwriting": I have never encountered bogus sitemaps but that was for a intranet crawl and it would be better to take care of that in this jira. Here is what I conclude from the discussion till now:
          (1) fetch interval: For old entries, don't use the value from sitemap. For new ones, use the value from sitemap provided (db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max)
          (2) score: Never use value from sitemap. For new ones, use scoring filters. Keep the value of old entries as it is.
          (3) modified time: Always use the value from sitemap provided its not a date in future.

          Did I get it right ?

          Re "score": I missed that the jar is old. Would file a jira to upgrade CC to v0.3 in Nutch.

          Show
          tejasp Tejas Patil added a comment - Re "filters and normalizers": +1. Re "fetch intervals" and "reducer overwriting": I have never encountered bogus sitemaps but that was for a intranet crawl and it would be better to take care of that in this jira. Here is what I conclude from the discussion till now: (1) fetch interval : For old entries, don't use the value from sitemap. For new ones, use the value from sitemap provided (db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max) (2) score : Never use value from sitemap. For new ones, use scoring filters. Keep the value of old entries as it is. (3) modified time : Always use the value from sitemap provided its not a date in future. Did I get it right ? Re "score": I missed that the jar is old. Would file a jira to upgrade CC to v0.3 in Nutch.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          (1) fetch interval: ...
          +1, sounds plausible.

          (2) score: Never use value from sitemap. For new ones, use scoring filters. Keep the value of old entries as it is.
          That means use ScoringFilter.initialScore(...) for new ones?
          Why not use the priority for newly found URLs? If the site owner takes it seriously the score can be useful. We could make it configurable, eg. by a factor sitemap.priority.factor. If it's 0.0 priority is not used. Usually, the factor should be low to avoid that the total score in the web graph (cf. FixingOpicScoring) get's too high when "injecting" 50.000 URLs from sitemaps each with 1.0 priority. Alternatively, we could just put values from sitemap in CrawlDatum's meta data and "delegate" any actions to set the score to scoring filters or FetchSchedule implementations. Users then can more easily adapt any sitemap logic to their needs (cf. below).

          (3) modified time: Always use the value from sitemap provided its not a date in future.
          Um, seems that this way is conceptually wrong (and was also in SitemapInjector).
          The modified time in CrawlDb must indicate the time of the last fetch or the modified time sent by the server when a page was fetched. If we overwrite the modified time, the server may just answer not-modified on a if-modified-since request and we'll never get the current version of a page. So we must not touch modified time, even for newly discovered pages, where it must be 0. If it's not zero, if-not-modified-since header field is sent although the page never has been fetched, cf. HttpResponse.java.
          If we can trust the sitemap the desired behaviour would be to set fetch time (in CrawlDb = time when next fetch should happen) to now (or sitemap modified time) if (and only if) sitemap.modif > crawldb.modif. This would make sure that changed pages are fetched asap. If the sitemap is not 100% trustworthy we should be more careful.
          Could we again delegate this decision (trustworthy or not) to scoring filter or FetchSchedule implementations? Whether we can trust a sitemap may depend on concrete crawler config/project and should be configurable. Would this require a new method in scoring/schedule interfaces?

          More open questions since before!? Comments are welcome!

          Show
          wastl-nagel Sebastian Nagel added a comment - (1) fetch interval: ... +1, sounds plausible. (2) score: Never use value from sitemap. For new ones, use scoring filters. Keep the value of old entries as it is. That means use ScoringFilter.initialScore(...) for new ones? Why not use the priority for newly found URLs? If the site owner takes it seriously the score can be useful. We could make it configurable, eg. by a factor sitemap.priority.factor . If it's 0.0 priority is not used. Usually, the factor should be low to avoid that the total score in the web graph (cf. FixingOpicScoring ) get's too high when "injecting" 50.000 URLs from sitemaps each with 1.0 priority. Alternatively, we could just put values from sitemap in CrawlDatum's meta data and "delegate" any actions to set the score to scoring filters or FetchSchedule implementations. Users then can more easily adapt any sitemap logic to their needs (cf. below). (3) modified time: Always use the value from sitemap provided its not a date in future. Um, seems that this way is conceptually wrong (and was also in SitemapInjector). The modified time in CrawlDb must indicate the time of the last fetch or the modified time sent by the server when a page was fetched. If we overwrite the modified time, the server may just answer not-modified on a if-modified-since request and we'll never get the current version of a page. So we must not touch modified time, even for newly discovered pages, where it must be 0. If it's not zero, if-not-modified-since header field is sent although the page never has been fetched, cf. HttpResponse.java. If we can trust the sitemap the desired behaviour would be to set fetch time (in CrawlDb = time when next fetch should happen) to now (or sitemap modified time) if (and only if) sitemap.modif > crawldb.modif. This would make sure that changed pages are fetched asap. If the sitemap is not 100% trustworthy we should be more careful. Could we again delegate this decision (trustworthy or not) to scoring filter or FetchSchedule implementations? Whether we can trust a sitemap may depend on concrete crawler config/project and should be configurable. Would this require a new method in scoring/schedule interfaces? More open questions since before!? Comments are welcome!
          Hide
          lewismc Lewis John McGibbney added a comment -

          I'm going to take this on. We want full sitemap support in our current crawlers so I am making this my priority. I'll submit a pull request for current patches then we can take it from there.

          Show
          lewismc Lewis John McGibbney added a comment - I'm going to take this on. We want full sitemap support in our current crawlers so I am making this my priority. I'll submit a pull request for current patches then we can take it from there.
          Hide
          wastl-nagel Sebastian Nagel added a comment - - edited

          Hi Lewis, a couple of month ago I've applied the latest patch here (NUTCH-1465-trunk.v5.patch) to master, see https://github.com/sebastian-nagel/nutch/tree/NUTCH-1465. But I had to port this to the Common Crawl fork of Nutch (https://github.com/commoncrawl/nutch), so I've chosen the SitemapInjector from an older patch which was still based on the old mapred API.

          Show
          wastl-nagel Sebastian Nagel added a comment - - edited Hi Lewis, a couple of month ago I've applied the latest patch here ( NUTCH-1465 -trunk.v5.patch) to master, see https://github.com/sebastian-nagel/nutch/tree/NUTCH-1465 . But I had to port this to the Common Crawl fork of Nutch ( https://github.com/commoncrawl/nutch ), so I've chosen the SitemapInjector from an older patch which was still based on the old mapred API.
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc opened a new pull request #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189

          Hi Folks this issue addresses NUTCH-1465(https://issues.apache.org/jira/browse/NUTCH-1465), I have an issue with some code which I will point out separately.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc opened a new pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189 Hi Folks this issue addresses NUTCH-1465 ( https://issues.apache.org/jira/browse/NUTCH-1465 ), I have an issue with some code which I will point out separately. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#discussion_r113578491

          ##########
          File path: src/java/org/apache/nutch/util/SitemapProcessor.java
          ##########
          @@ -0,0 +1,436 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.nutch.util;
          +
          +import java.io.IOException;
          +import java.net.URL;
          +import java.text.SimpleDateFormat;
          +import java.util.Collection;
          +import java.util.LinkedList;
          +import java.util.List;
          +import java.util.Random;
          +
          +import org.apache.hadoop.conf.Configuration;
          +import org.apache.hadoop.conf.Configured;
          +import org.apache.hadoop.fs.FileSystem;
          +import org.apache.hadoop.fs.Path;
          +import org.apache.hadoop.io.Text;
          +import org.apache.hadoop.io.Writable;
          +import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.Mapper;
          +import org.apache.hadoop.mapreduce.Reducer;
          +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
          +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
          +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
          +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
          +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
          +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
          +import org.apache.hadoop.util.StringUtils;
          +import org.apache.hadoop.util.Tool;
          +import org.apache.hadoop.util.ToolRunner;
          +
          +import org.apache.nutch.crawl.CrawlDatum;
          +import org.apache.nutch.hostdb.HostDatum;
          +import org.apache.nutch.net.URLFilters;
          +import org.apache.nutch.net.URLNormalizers;
          +import org.apache.nutch.protocol.Content;
          +import org.apache.nutch.protocol.Protocol;
          +import org.apache.nutch.protocol.ProtocolFactory;
          +import org.apache.nutch.protocol.ProtocolOutput;
          +import org.apache.nutch.protocol.ProtocolStatus;
          +
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import crawlercommons.robots.BaseRobotRules;
          +import crawlercommons.sitemaps.AbstractSiteMap;
          +import crawlercommons.sitemaps.SiteMap;
          +import crawlercommons.sitemaps.SiteMapIndex;
          +import crawlercommons.sitemaps.SiteMapParser;
          +import crawlercommons.sitemaps.SiteMapURL;
          +
          +/**
          + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging
          + * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
          + *
          + * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
          + * <ol>
          + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a
          + * list of sitemap links and get only those sitemap pages. This suits well for targeted
          + * crawl of specific hosts.</li>
          + * <li>For open web crawl, it is not possible to track each host and get the sitemap links
          + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the
          + * crawls and inject the urls from sitemap to the crawldb.</li>
          + * </ol>
          + *
          + * <p>For more details see:
          + * https://wiki.apache.org/nutch/SitemapFeature </p>
          + */
          +public class SitemapProcessor extends Configured implements Tool {
          + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class);
          + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
          +
          + public static final String CURRENT_NAME = "current";
          + public static final String LOCK_NAME = ".locked";
          + public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing";
          + public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter";
          + public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize";
          +
          + private static class SitemapMapper extends Mapper<Text, Writable, Text, CrawlDatum> {
          + private ProtocolFactory protocolFactory = null;
          + private boolean strict = true;
          + private boolean filter = true;
          + private boolean normalize = true;
          + private URLFilters filters = null;
          + private URLNormalizers normalizers = null;
          + private CrawlDatum datum = new CrawlDatum();
          + private SiteMapParser parser = null;
          +
          + public void setup(Context context)

          { + Configuration conf = context.getConfiguration(); + this.protocolFactory = new ProtocolFactory(conf); + this.filter = conf.getBoolean(SITEMAP_URL_FILTERING, true); + this.normalize = conf.getBoolean(SITEMAP_URL_NORMALIZING, true); + this.strict = conf.getBoolean(SITEMAP_STRICT_PARSING, true); + this.parser = new SiteMapParser(strict); + + if (filter) + filters = new URLFilters(conf); + if (normalize) + normalizers = new URLNormalizers(conf, URLNormalizers.SCOPE_DEFAULT); + }

          +
          + public void map(Text key, Writable value, Context context) throws IOException, InterruptedException {
          + String url;
          +
          + try {
          + if (value instanceof CrawlDatum)

          { + // If its an entry from CrawlDb, emit it. It will be merged in the reducer + context.write(key, (CrawlDatum) value); + }

          + else if (value instanceof HostDatum) {
          + // For entry from hostdb, get sitemap url(s) from robots.txt, fetch the sitemap,
          + // extract urls and emit those
          +
          + // try different combinations of schemes one by one till we get rejection in all cases
          + String host = key.toString();
          + if((url = filterNormalize("http://" + host + "/")) == null &&
          + (url = filterNormalize("https://" + host + "/")) == null &&
          + (url = filterNormalize("ftp://" + host + "/")) == null &&
          + (url = filterNormalize("file:/" + host + "/")) == null)

          { + context.getCounter("Sitemap", "filtered_records").increment(1); + return; + }

          +
          + BaseRobotRules rules = protocolFactory.getProtocol(url).getRobotRules(new Text(url), datum, new LinkedList<>());

          Review comment:
          Always passing a new LinkedList as the third parameter to the [.getRobotsRules](https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/protocol/Protocol.html#getRobotRules-org.apache.hadoop.io.Text-org.apache.nutch.crawl.CrawlDatum-java.util.List-) method call may not be preferable. I've looked at the code and we have the option to pass null. This needs to be tested.
          I have seen elsewhere in the codebase that use of this signature aligns with use of fetcher.store.robotstxt configuration property... so we may wish to so the same here and align it.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113578491 ########## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ########## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb.</p> + * + * <p>There are two use cases supported in Nutch's Sitemap processing:</p> + * <ol> + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts.</li> + * <li>For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb.</li> + * </ol> + * + * <p>For more details see: + * https://wiki.apache.org/nutch/SitemapFeature </p> + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; + public static final String LOCK_NAME = ".locked"; + public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing"; + public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter"; + public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize"; + + private static class SitemapMapper extends Mapper<Text, Writable, Text, CrawlDatum> { + private ProtocolFactory protocolFactory = null; + private boolean strict = true; + private boolean filter = true; + private boolean normalize = true; + private URLFilters filters = null; + private URLNormalizers normalizers = null; + private CrawlDatum datum = new CrawlDatum(); + private SiteMapParser parser = null; + + public void setup(Context context) { + Configuration conf = context.getConfiguration(); + this.protocolFactory = new ProtocolFactory(conf); + this.filter = conf.getBoolean(SITEMAP_URL_FILTERING, true); + this.normalize = conf.getBoolean(SITEMAP_URL_NORMALIZING, true); + this.strict = conf.getBoolean(SITEMAP_STRICT_PARSING, true); + this.parser = new SiteMapParser(strict); + + if (filter) + filters = new URLFilters(conf); + if (normalize) + normalizers = new URLNormalizers(conf, URLNormalizers.SCOPE_DEFAULT); + } + + public void map(Text key, Writable value, Context context) throws IOException, InterruptedException { + String url; + + try { + if (value instanceof CrawlDatum) { + // If its an entry from CrawlDb, emit it. It will be merged in the reducer + context.write(key, (CrawlDatum) value); + } + else if (value instanceof HostDatum) { + // For entry from hostdb, get sitemap url(s) from robots.txt, fetch the sitemap, + // extract urls and emit those + + // try different combinations of schemes one by one till we get rejection in all cases + String host = key.toString(); + if((url = filterNormalize("http://" + host + "/")) == null && + (url = filterNormalize("https://" + host + "/")) == null && + (url = filterNormalize("ftp://" + host + "/")) == null && + (url = filterNormalize("file:/" + host + "/")) == null) { + context.getCounter("Sitemap", "filtered_records").increment(1); + return; + } + + BaseRobotRules rules = protocolFactory.getProtocol(url).getRobotRules(new Text(url), datum, new LinkedList<>()); Review comment: Always passing a new LinkedList as the third parameter to the [.getRobotsRules] ( https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/protocol/Protocol.html#getRobotRules-org.apache.hadoop.io.Text-org.apache.nutch.crawl.CrawlDatum-java.util.List- ) method call may not be preferable. I've looked at the code and we have the option to pass null. This needs to be tested. I have seen elsewhere in the codebase that use of this signature aligns with use of fetcher.store.robotstxt configuration property... so we may wish to so the same here and align it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#discussion_r113578673

          ##########
          File path: src/java/org/apache/nutch/util/SitemapProcessor.java
          ##########
          @@ -0,0 +1,436 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.nutch.util;
          +
          +import java.io.IOException;
          +import java.net.URL;
          +import java.text.SimpleDateFormat;
          +import java.util.Collection;
          +import java.util.LinkedList;
          +import java.util.List;
          +import java.util.Random;
          +
          +import org.apache.hadoop.conf.Configuration;
          +import org.apache.hadoop.conf.Configured;
          +import org.apache.hadoop.fs.FileSystem;
          +import org.apache.hadoop.fs.Path;
          +import org.apache.hadoop.io.Text;
          +import org.apache.hadoop.io.Writable;
          +import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.Mapper;
          +import org.apache.hadoop.mapreduce.Reducer;
          +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
          +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
          +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
          +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
          +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
          +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
          +import org.apache.hadoop.util.StringUtils;
          +import org.apache.hadoop.util.Tool;
          +import org.apache.hadoop.util.ToolRunner;
          +
          +import org.apache.nutch.crawl.CrawlDatum;
          +import org.apache.nutch.hostdb.HostDatum;
          +import org.apache.nutch.net.URLFilters;
          +import org.apache.nutch.net.URLNormalizers;
          +import org.apache.nutch.protocol.Content;
          +import org.apache.nutch.protocol.Protocol;
          +import org.apache.nutch.protocol.ProtocolFactory;
          +import org.apache.nutch.protocol.ProtocolOutput;
          +import org.apache.nutch.protocol.ProtocolStatus;
          +
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import crawlercommons.robots.BaseRobotRules;
          +import crawlercommons.sitemaps.AbstractSiteMap;
          +import crawlercommons.sitemaps.SiteMap;
          +import crawlercommons.sitemaps.SiteMapIndex;
          +import crawlercommons.sitemaps.SiteMapParser;
          +import crawlercommons.sitemaps.SiteMapURL;
          +
          +/**
          + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging
          + * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
          + *
          + * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
          + * <ol>
          + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a
          + * list of sitemap links and get only those sitemap pages. This suits well for targeted
          + * crawl of specific hosts.</li>
          + * <li>For open web crawl, it is not possible to track each host and get the sitemap links
          + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the
          + * crawls and inject the urls from sitemap to the crawldb.</li>
          + * </ol>
          + *
          + * <p>For more details see:
          + * https://wiki.apache.org/nutch/SitemapFeature </p>
          + */
          +public class SitemapProcessor extends Configured implements Tool {
          + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class);
          + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
          +
          + public static final String CURRENT_NAME = "current";

          Review comment:
          I also introduced this constant to mimic what is done in CrawlDb and LinkDb classes. This is means that represent the current HostDb... of course we don't have a HostDb class in the codebase right now so this constant has been introduced.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113578673 ########## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ########## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb.</p> + * + * <p>There are two use cases supported in Nutch's Sitemap processing:</p> + * <ol> + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts.</li> + * <li>For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb.</li> + * </ol> + * + * <p>For more details see: + * https://wiki.apache.org/nutch/SitemapFeature </p> + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; Review comment: I also introduced this constant to mimic what is done in CrawlDb and LinkDb classes. This is means that represent the current HostDb... of course we don't have a HostDb class in the codebase right now so this constant has been introduced. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#issuecomment-297560764

          We could also improve with parameterized logging in due course. I wanted to post this patch as a mechanism for relighting the interest in Sitemap parsing with master branch.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#issuecomment-297560764 We could also improve with parameterized logging in due course. I wanted to post this patch as a mechanism for relighting the interest in Sitemap parsing with master branch. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#discussion_r113689082

          ##########
          File path: src/java/org/apache/nutch/util/SitemapProcessor.java
          ##########
          @@ -0,0 +1,436 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.nutch.util;
          +
          +import java.io.IOException;
          +import java.net.URL;
          +import java.text.SimpleDateFormat;
          +import java.util.Collection;
          +import java.util.LinkedList;
          +import java.util.List;
          +import java.util.Random;
          +
          +import org.apache.hadoop.conf.Configuration;
          +import org.apache.hadoop.conf.Configured;
          +import org.apache.hadoop.fs.FileSystem;
          +import org.apache.hadoop.fs.Path;
          +import org.apache.hadoop.io.Text;
          +import org.apache.hadoop.io.Writable;
          +import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.Mapper;
          +import org.apache.hadoop.mapreduce.Reducer;
          +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
          +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
          +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
          +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
          +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
          +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
          +import org.apache.hadoop.util.StringUtils;
          +import org.apache.hadoop.util.Tool;
          +import org.apache.hadoop.util.ToolRunner;
          +
          +import org.apache.nutch.crawl.CrawlDatum;
          +import org.apache.nutch.hostdb.HostDatum;
          +import org.apache.nutch.net.URLFilters;
          +import org.apache.nutch.net.URLNormalizers;
          +import org.apache.nutch.protocol.Content;
          +import org.apache.nutch.protocol.Protocol;
          +import org.apache.nutch.protocol.ProtocolFactory;
          +import org.apache.nutch.protocol.ProtocolOutput;
          +import org.apache.nutch.protocol.ProtocolStatus;
          +
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import crawlercommons.robots.BaseRobotRules;
          +import crawlercommons.sitemaps.AbstractSiteMap;
          +import crawlercommons.sitemaps.SiteMap;
          +import crawlercommons.sitemaps.SiteMapIndex;
          +import crawlercommons.sitemaps.SiteMapParser;
          +import crawlercommons.sitemaps.SiteMapURL;
          +
          +/**
          + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging
          + * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
          + *
          + * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
          + * <ol>
          + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a
          + * list of sitemap links and get only those sitemap pages. This suits well for targeted
          + * crawl of specific hosts.</li>
          + * <li>For open web crawl, it is not possible to track each host and get the sitemap links
          + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the
          + * crawls and inject the urls from sitemap to the crawldb.</li>
          + * </ol>
          + *
          + * <p>For more details see:
          + * https://wiki.apache.org/nutch/SitemapFeature </p>
          + */
          +public class SitemapProcessor extends Configured implements Tool {
          + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class);
          + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
          +
          + public static final String CURRENT_NAME = "current";

          Review comment:
          But in [ReadHostDb](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/hostdb/ReadHostDb.java#L182) and [UpdateHostDb](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/hostdb/UpdateHostDb.java#L107) still a String literal `"current"` is used.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113689082 ########## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ########## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb.</p> + * + * <p>There are two use cases supported in Nutch's Sitemap processing:</p> + * <ol> + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts.</li> + * <li>For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb.</li> + * </ol> + * + * <p>For more details see: + * https://wiki.apache.org/nutch/SitemapFeature </p> + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; Review comment: But in [ReadHostDb] ( https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/hostdb/ReadHostDb.java#L182 ) and [UpdateHostDb] ( https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/hostdb/UpdateHostDb.java#L107 ) still a String literal `"current"` is used. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#discussion_r113689552

          ##########
          File path: src/java/org/apache/nutch/util/SitemapProcessor.java
          ##########
          @@ -0,0 +1,436 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.nutch.util;
          +
          +import java.io.IOException;
          +import java.net.URL;
          +import java.text.SimpleDateFormat;
          +import java.util.Collection;
          +import java.util.LinkedList;
          +import java.util.List;
          +import java.util.Random;
          +
          +import org.apache.hadoop.conf.Configuration;
          +import org.apache.hadoop.conf.Configured;
          +import org.apache.hadoop.fs.FileSystem;
          +import org.apache.hadoop.fs.Path;
          +import org.apache.hadoop.io.Text;
          +import org.apache.hadoop.io.Writable;
          +import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.Mapper;
          +import org.apache.hadoop.mapreduce.Reducer;
          +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
          +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
          +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
          +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
          +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
          +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
          +import org.apache.hadoop.util.StringUtils;
          +import org.apache.hadoop.util.Tool;
          +import org.apache.hadoop.util.ToolRunner;
          +
          +import org.apache.nutch.crawl.CrawlDatum;
          +import org.apache.nutch.hostdb.HostDatum;
          +import org.apache.nutch.net.URLFilters;
          +import org.apache.nutch.net.URLNormalizers;
          +import org.apache.nutch.protocol.Content;
          +import org.apache.nutch.protocol.Protocol;
          +import org.apache.nutch.protocol.ProtocolFactory;
          +import org.apache.nutch.protocol.ProtocolOutput;
          +import org.apache.nutch.protocol.ProtocolStatus;
          +
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import crawlercommons.robots.BaseRobotRules;
          +import crawlercommons.sitemaps.AbstractSiteMap;
          +import crawlercommons.sitemaps.SiteMap;
          +import crawlercommons.sitemaps.SiteMapIndex;
          +import crawlercommons.sitemaps.SiteMapParser;
          +import crawlercommons.sitemaps.SiteMapURL;
          +
          +/**
          + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging
          + * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
          + *
          + * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
          + * <ol>
          + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a
          + * list of sitemap links and get only those sitemap pages. This suits well for targeted
          + * crawl of specific hosts.</li>
          + * <li>For open web crawl, it is not possible to track each host and get the sitemap links
          + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the
          + * crawls and inject the urls from sitemap to the crawldb.</li>
          + * </ol>
          + *
          + * <p>For more details see:
          + * https://wiki.apache.org/nutch/SitemapFeature </p>
          + */
          +public class SitemapProcessor extends Configured implements Tool {
          + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class);
          + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
          +
          + public static final String CURRENT_NAME = "current";
          + public static final String LOCK_NAME = ".locked";
          + public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing";
          + public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter";
          + public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize";
          +
          + private static class SitemapMapper extends Mapper<Text, Writable, Text, CrawlDatum> {
          + private ProtocolFactory protocolFactory = null;
          + private boolean strict = true;
          + private boolean filter = true;
          + private boolean normalize = true;
          + private URLFilters filters = null;
          + private URLNormalizers normalizers = null;
          + private CrawlDatum datum = new CrawlDatum();
          + private SiteMapParser parser = null;
          +
          + public void setup(Context context)

          { + Configuration conf = context.getConfiguration(); + this.protocolFactory = new ProtocolFactory(conf); + this.filter = conf.getBoolean(SITEMAP_URL_FILTERING, true); + this.normalize = conf.getBoolean(SITEMAP_URL_NORMALIZING, true); + this.strict = conf.getBoolean(SITEMAP_STRICT_PARSING, true); + this.parser = new SiteMapParser(strict); + + if (filter) + filters = new URLFilters(conf); + if (normalize) + normalizers = new URLNormalizers(conf, URLNormalizers.SCOPE_DEFAULT); + }

          +
          + public void map(Text key, Writable value, Context context) throws IOException, InterruptedException {
          + String url;
          +
          + try {
          + if (value instanceof CrawlDatum)

          { + // If its an entry from CrawlDb, emit it. It will be merged in the reducer + context.write(key, (CrawlDatum) value); + }

          + else if (value instanceof HostDatum) {
          + // For entry from hostdb, get sitemap url(s) from robots.txt, fetch the sitemap,
          + // extract urls and emit those
          +
          + // try different combinations of schemes one by one till we get rejection in all cases
          + String host = key.toString();
          + if((url = filterNormalize("http://" + host + "/")) == null &&
          + (url = filterNormalize("https://" + host + "/")) == null &&
          + (url = filterNormalize("ftp://" + host + "/")) == null &&
          + (url = filterNormalize("file:/" + host + "/")) == null)

          { + context.getCounter("Sitemap", "filtered_records").increment(1); + return; + }

          +
          + BaseRobotRules rules = protocolFactory.getProtocol(url).getRobotRules(new Text(url), datum, new LinkedList<>());

          Review comment:
          It's safe to pass null unless you want to use the robots.txt content.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113689552 ########## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ########## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb.</p> + * + * <p>There are two use cases supported in Nutch's Sitemap processing:</p> + * <ol> + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts.</li> + * <li>For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb.</li> + * </ol> + * + * <p>For more details see: + * https://wiki.apache.org/nutch/SitemapFeature </p> + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; + public static final String LOCK_NAME = ".locked"; + public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing"; + public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter"; + public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize"; + + private static class SitemapMapper extends Mapper<Text, Writable, Text, CrawlDatum> { + private ProtocolFactory protocolFactory = null; + private boolean strict = true; + private boolean filter = true; + private boolean normalize = true; + private URLFilters filters = null; + private URLNormalizers normalizers = null; + private CrawlDatum datum = new CrawlDatum(); + private SiteMapParser parser = null; + + public void setup(Context context) { + Configuration conf = context.getConfiguration(); + this.protocolFactory = new ProtocolFactory(conf); + this.filter = conf.getBoolean(SITEMAP_URL_FILTERING, true); + this.normalize = conf.getBoolean(SITEMAP_URL_NORMALIZING, true); + this.strict = conf.getBoolean(SITEMAP_STRICT_PARSING, true); + this.parser = new SiteMapParser(strict); + + if (filter) + filters = new URLFilters(conf); + if (normalize) + normalizers = new URLNormalizers(conf, URLNormalizers.SCOPE_DEFAULT); + } + + public void map(Text key, Writable value, Context context) throws IOException, InterruptedException { + String url; + + try { + if (value instanceof CrawlDatum) { + // If its an entry from CrawlDb, emit it. It will be merged in the reducer + context.write(key, (CrawlDatum) value); + } + else if (value instanceof HostDatum) { + // For entry from hostdb, get sitemap url(s) from robots.txt, fetch the sitemap, + // extract urls and emit those + + // try different combinations of schemes one by one till we get rejection in all cases + String host = key.toString(); + if((url = filterNormalize("http://" + host + "/")) == null && + (url = filterNormalize("https://" + host + "/")) == null && + (url = filterNormalize("ftp://" + host + "/")) == null && + (url = filterNormalize("file:/" + host + "/")) == null) { + context.getCounter("Sitemap", "filtered_records").increment(1); + return; + } + + BaseRobotRules rules = protocolFactory.getProtocol(url).getRobotRules(new Text(url), datum, new LinkedList<>()); Review comment: It's safe to pass null unless you want to use the robots.txt content. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#discussion_r113693977

          ##########
          File path: src/java/org/apache/nutch/util/SitemapProcessor.java
          ##########
          @@ -0,0 +1,436 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.nutch.util;
          +
          +import java.io.IOException;
          +import java.net.URL;
          +import java.text.SimpleDateFormat;
          +import java.util.Collection;
          +import java.util.LinkedList;
          +import java.util.List;
          +import java.util.Random;
          +
          +import org.apache.hadoop.conf.Configuration;
          +import org.apache.hadoop.conf.Configured;
          +import org.apache.hadoop.fs.FileSystem;
          +import org.apache.hadoop.fs.Path;
          +import org.apache.hadoop.io.Text;
          +import org.apache.hadoop.io.Writable;
          +import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.Mapper;
          +import org.apache.hadoop.mapreduce.Reducer;
          +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
          +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
          +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
          +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
          +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
          +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
          +import org.apache.hadoop.util.StringUtils;
          +import org.apache.hadoop.util.Tool;
          +import org.apache.hadoop.util.ToolRunner;
          +
          +import org.apache.nutch.crawl.CrawlDatum;
          +import org.apache.nutch.hostdb.HostDatum;
          +import org.apache.nutch.net.URLFilters;
          +import org.apache.nutch.net.URLNormalizers;
          +import org.apache.nutch.protocol.Content;
          +import org.apache.nutch.protocol.Protocol;
          +import org.apache.nutch.protocol.ProtocolFactory;
          +import org.apache.nutch.protocol.ProtocolOutput;
          +import org.apache.nutch.protocol.ProtocolStatus;
          +
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import crawlercommons.robots.BaseRobotRules;
          +import crawlercommons.sitemaps.AbstractSiteMap;
          +import crawlercommons.sitemaps.SiteMap;
          +import crawlercommons.sitemaps.SiteMapIndex;
          +import crawlercommons.sitemaps.SiteMapParser;
          +import crawlercommons.sitemaps.SiteMapURL;
          +
          +/**
          + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging
          + * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
          + *
          + * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
          + * <ol>
          + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a
          + * list of sitemap links and get only those sitemap pages. This suits well for targeted
          + * crawl of specific hosts.</li>
          + * <li>For open web crawl, it is not possible to track each host and get the sitemap links
          + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the
          + * crawls and inject the urls from sitemap to the crawldb.</li>
          + * </ol>
          + *
          + * <p>For more details see:
          + * https://wiki.apache.org/nutch/SitemapFeature </p>
          + */
          +public class SitemapProcessor extends Configured implements Tool {
          + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class);
          + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
          +
          + public static final String CURRENT_NAME = "current";
          + public static final String LOCK_NAME = ".locked";
          + public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing";
          + public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter";
          + public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize";
          +
          + private static class SitemapMapper extends Mapper<Text, Writable, Text, CrawlDatum> {
          + private ProtocolFactory protocolFactory = null;
          + private boolean strict = true;
          + private boolean filter = true;
          + private boolean normalize = true;
          + private URLFilters filters = null;
          + private URLNormalizers normalizers = null;
          + private CrawlDatum datum = new CrawlDatum();
          + private SiteMapParser parser = null;
          +
          + public void setup(Context context)

          { + Configuration conf = context.getConfiguration(); + this.protocolFactory = new ProtocolFactory(conf); + this.filter = conf.getBoolean(SITEMAP_URL_FILTERING, true); + this.normalize = conf.getBoolean(SITEMAP_URL_NORMALIZING, true); + this.strict = conf.getBoolean(SITEMAP_STRICT_PARSING, true); + this.parser = new SiteMapParser(strict); + + if (filter) + filters = new URLFilters(conf); + if (normalize) + normalizers = new URLNormalizers(conf, URLNormalizers.SCOPE_DEFAULT); + }

          +
          + public void map(Text key, Writable value, Context context) throws IOException, InterruptedException {
          + String url;
          +
          + try {
          + if (value instanceof CrawlDatum)

          { + // If its an entry from CrawlDb, emit it. It will be merged in the reducer + context.write(key, (CrawlDatum) value); + }

          + else if (value instanceof HostDatum) {
          + // For entry from hostdb, get sitemap url(s) from robots.txt, fetch the sitemap,
          + // extract urls and emit those
          +
          + // try different combinations of schemes one by one till we get rejection in all cases
          + String host = key.toString();
          + if((url = filterNormalize("http://" + host + "/")) == null &&
          + (url = filterNormalize("https://" + host + "/")) == null &&
          + (url = filterNormalize("ftp://" + host + "/")) == null &&
          + (url = filterNormalize("file:/" + host + "/")) == null)

          { + context.getCounter("Sitemap", "filtered_records").increment(1); + return; + }
          +
          + BaseRobotRules rules = protocolFactory.getProtocol(url).getRobotRules(new Text(url), datum, new LinkedList<>());
          + List<String> sitemaps = rules.getSitemaps();
          + for(String sitemap: sitemaps) { + context.getCounter("Sitemap", "sitemaps_from_hostdb").increment(1); + generateSitemapUrlDatum(protocolFactory.getProtocol(sitemap), sitemap, context); + }
          + }
          + else if (value instanceof Text) {
          + // For entry from sitemap urls file, fetch the sitemap, extract urls and emit those
          + if((url = filterNormalize(key.toString())) == null) {+ context.getCounter("Sitemap", "filtered_records").increment(1);+ return;+ }

          +
          + context.getCounter("Sitemap", "sitemap_seeds").increment(1);
          + generateSitemapUrlDatum(protocolFactory.getProtocol(url), url, context);
          + }
          + } catch (Exception e)

          { + LOG.warn("Exception for record " + key.toString() + " : " + StringUtils.stringifyException(e)); + }

          + }
          +
          + /* Filters and or normalizes the input URL */
          + private String filterNormalize(String url) {
          + try

          { + if (normalizers != null) + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); + + if (filters != null) + url = filters.filter(url); + }

          catch (Exception e)

          { + return null; + }

          + return url;
          + }
          +
          + private void generateSitemapUrlDatum(Protocol protocol, String url, Context context) throws Exception {
          + ProtocolOutput output = protocol.getProtocolOutput(new Text(url), datum);
          + ProtocolStatus status = output.getStatus();
          + Content content = output.getContent();
          +
          + if(status.getCode() != ProtocolStatus.SUCCESS)

          { + // If there were any problems fetching the sitemap, log the error and let it go. Not sure how often + // sitemaps are redirected. In future we might have to handle redirects. + context.getCounter("Sitemap", "failed_fetches").increment(1); + LOG.error("Error while fetching the sitemap. Status code: " + status.getCode() + " for " + url); + return; + }

          +
          + AbstractSiteMap asm = parser.parseSiteMap(content.getContentType(), content.getContent(), new URL(url));
          + if(asm instanceof SiteMap) {
          + SiteMap sm = (SiteMap) asm;
          + Collection<SiteMapURL> sitemapUrls = sm.getSiteMapUrls();
          +
          + for(SiteMapURL sitemapUrl: sitemapUrls) {
          + // If 'strict' is ON, only allow valid urls. Else allow all urls
          + if(!strict || sitemapUrl.isValid()) {
          + String key = filterNormalize(sitemapUrl.getUrl().toString());
          + if (key != null) {
          + CrawlDatum sitemapUrlDatum = new CrawlDatum();
          + sitemapUrlDatum.setStatus(CrawlDatum.STATUS_SITEMAP);
          + sitemapUrlDatum.setScore((float) sitemapUrl.getPriority());
          +
          + if(sitemapUrl.getChangeFrequency() != null) {
          + int fetchInterval = -1;
          + switch(sitemapUrl.getChangeFrequency())

          { + case ALWAYS: fetchInterval = 1; break; + case HOURLY: fetchInterval = 3600; break; // 60*60 + case DAILY: fetchInterval = 86400; break; // 60*60*24 + case WEEKLY: fetchInterval = 604800; break; // 60*60*24*7 + case MONTHLY: fetchInterval = 2592000; break; // 60*60*24*30 + case YEARLY: fetchInterval = 31536000; break; // 60*60*24*365 + case NEVER: fetchInterval = Integer.MAX_VALUE; break; // Loose "NEVER" contract + }

          + sitemapUrlDatum.setFetchInterval(fetchInterval);
          + }
          +
          + if(sitemapUrl.getLastModified() != null)
          + sitemapUrlDatum.setModifiedTime(sitemapUrl.getLastModified().getTime());
          +
          + context.write(new Text(key), sitemapUrlDatum);
          + }
          + }
          + }
          + }
          + else if (asm instanceof SiteMapIndex) {
          + SiteMapIndex index = (SiteMapIndex) asm;
          + Collection<AbstractSiteMap> sitemapUrls = index.getSitemaps();
          +
          + for(AbstractSiteMap sitemap: sitemapUrls) {
          + if(sitemap.isIndex())

          { + generateSitemapUrlDatum(protocol, sitemap.getUrl().toString(), context); + }

          + }
          + }
          + }
          + }
          +
          + private static class SitemapReducer extends Reducer<Text, CrawlDatum, Text, CrawlDatum> {
          + CrawlDatum sitemapDatum = null;
          + CrawlDatum originalDatum = null;
          +
          + public void reduce(Text key, Iterable<CrawlDatum> values, Context context)
          + throws IOException, InterruptedException {
          + sitemapDatum = null;
          + originalDatum = null;
          +
          + for (CrawlDatum curr: values) {
          + if(curr.getStatus() == CrawlDatum.STATUS_SITEMAP && sitemapDatum == null)

          { + sitemapDatum = new CrawlDatum(); + sitemapDatum.set(curr); + }

          + else

          { + originalDatum = new CrawlDatum(); + originalDatum.set(curr); + }

          + }
          +
          + if(originalDatum != null) {
          + // The url was already present in crawldb. If we got the same url from sitemap too, save
          + // the information from sitemap to the original datum. Emit the original crawl datum
          + if(sitemapDatum != null) {
          + context.getCounter("Sitemap", "existing_sitemap_entries").increment(1);
          + originalDatum.setScore(sitemapDatum.getScore());

          Review comment:

          See the discussion in NUTCH-1465(https://issues.apache.org/jira/browse/NUTCH-1465) on Jan 30-31, 2014. I was also wrong how to map these "concepts" from sitemaps to Nutch-internal ones.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113693977 ########## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ########## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb.</p> + * + * <p>There are two use cases supported in Nutch's Sitemap processing:</p> + * <ol> + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts.</li> + * <li>For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb.</li> + * </ol> + * + * <p>For more details see: + * https://wiki.apache.org/nutch/SitemapFeature </p> + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; + public static final String LOCK_NAME = ".locked"; + public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing"; + public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter"; + public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize"; + + private static class SitemapMapper extends Mapper<Text, Writable, Text, CrawlDatum> { + private ProtocolFactory protocolFactory = null; + private boolean strict = true; + private boolean filter = true; + private boolean normalize = true; + private URLFilters filters = null; + private URLNormalizers normalizers = null; + private CrawlDatum datum = new CrawlDatum(); + private SiteMapParser parser = null; + + public void setup(Context context) { + Configuration conf = context.getConfiguration(); + this.protocolFactory = new ProtocolFactory(conf); + this.filter = conf.getBoolean(SITEMAP_URL_FILTERING, true); + this.normalize = conf.getBoolean(SITEMAP_URL_NORMALIZING, true); + this.strict = conf.getBoolean(SITEMAP_STRICT_PARSING, true); + this.parser = new SiteMapParser(strict); + + if (filter) + filters = new URLFilters(conf); + if (normalize) + normalizers = new URLNormalizers(conf, URLNormalizers.SCOPE_DEFAULT); + } + + public void map(Text key, Writable value, Context context) throws IOException, InterruptedException { + String url; + + try { + if (value instanceof CrawlDatum) { + // If its an entry from CrawlDb, emit it. It will be merged in the reducer + context.write(key, (CrawlDatum) value); + } + else if (value instanceof HostDatum) { + // For entry from hostdb, get sitemap url(s) from robots.txt, fetch the sitemap, + // extract urls and emit those + + // try different combinations of schemes one by one till we get rejection in all cases + String host = key.toString(); + if((url = filterNormalize("http://" + host + "/")) == null && + (url = filterNormalize("https://" + host + "/")) == null && + (url = filterNormalize("ftp://" + host + "/")) == null && + (url = filterNormalize("file:/" + host + "/")) == null) { + context.getCounter("Sitemap", "filtered_records").increment(1); + return; + } + + BaseRobotRules rules = protocolFactory.getProtocol(url).getRobotRules(new Text(url), datum, new LinkedList<>()); + List<String> sitemaps = rules.getSitemaps(); + for(String sitemap: sitemaps) { + context.getCounter("Sitemap", "sitemaps_from_hostdb").increment(1); + generateSitemapUrlDatum(protocolFactory.getProtocol(sitemap), sitemap, context); + } + } + else if (value instanceof Text) { + // For entry from sitemap urls file, fetch the sitemap, extract urls and emit those + if((url = filterNormalize(key.toString())) == null) {+ context.getCounter("Sitemap", "filtered_records").increment(1);+ return;+ } + + context.getCounter("Sitemap", "sitemap_seeds").increment(1); + generateSitemapUrlDatum(protocolFactory.getProtocol(url), url, context); + } + } catch (Exception e) { + LOG.warn("Exception for record " + key.toString() + " : " + StringUtils.stringifyException(e)); + } + } + + /* Filters and or normalizes the input URL */ + private String filterNormalize(String url) { + try { + if (normalizers != null) + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); + + if (filters != null) + url = filters.filter(url); + } catch (Exception e) { + return null; + } + return url; + } + + private void generateSitemapUrlDatum(Protocol protocol, String url, Context context) throws Exception { + ProtocolOutput output = protocol.getProtocolOutput(new Text(url), datum); + ProtocolStatus status = output.getStatus(); + Content content = output.getContent(); + + if(status.getCode() != ProtocolStatus.SUCCESS) { + // If there were any problems fetching the sitemap, log the error and let it go. Not sure how often + // sitemaps are redirected. In future we might have to handle redirects. + context.getCounter("Sitemap", "failed_fetches").increment(1); + LOG.error("Error while fetching the sitemap. Status code: " + status.getCode() + " for " + url); + return; + } + + AbstractSiteMap asm = parser.parseSiteMap(content.getContentType(), content.getContent(), new URL(url)); + if(asm instanceof SiteMap) { + SiteMap sm = (SiteMap) asm; + Collection<SiteMapURL> sitemapUrls = sm.getSiteMapUrls(); + + for(SiteMapURL sitemapUrl: sitemapUrls) { + // If 'strict' is ON, only allow valid urls. Else allow all urls + if(!strict || sitemapUrl.isValid()) { + String key = filterNormalize(sitemapUrl.getUrl().toString()); + if (key != null) { + CrawlDatum sitemapUrlDatum = new CrawlDatum(); + sitemapUrlDatum.setStatus(CrawlDatum.STATUS_SITEMAP); + sitemapUrlDatum.setScore((float) sitemapUrl.getPriority()); + + if(sitemapUrl.getChangeFrequency() != null) { + int fetchInterval = -1; + switch(sitemapUrl.getChangeFrequency()) { + case ALWAYS: fetchInterval = 1; break; + case HOURLY: fetchInterval = 3600; break; // 60*60 + case DAILY: fetchInterval = 86400; break; // 60*60*24 + case WEEKLY: fetchInterval = 604800; break; // 60*60*24*7 + case MONTHLY: fetchInterval = 2592000; break; // 60*60*24*30 + case YEARLY: fetchInterval = 31536000; break; // 60*60*24*365 + case NEVER: fetchInterval = Integer.MAX_VALUE; break; // Loose "NEVER" contract + } + sitemapUrlDatum.setFetchInterval(fetchInterval); + } + + if(sitemapUrl.getLastModified() != null) + sitemapUrlDatum.setModifiedTime(sitemapUrl.getLastModified().getTime()); + + context.write(new Text(key), sitemapUrlDatum); + } + } + } + } + else if (asm instanceof SiteMapIndex) { + SiteMapIndex index = (SiteMapIndex) asm; + Collection<AbstractSiteMap> sitemapUrls = index.getSitemaps(); + + for(AbstractSiteMap sitemap: sitemapUrls) { + if(sitemap.isIndex()) { + generateSitemapUrlDatum(protocol, sitemap.getUrl().toString(), context); + } + } + } + } + } + + private static class SitemapReducer extends Reducer<Text, CrawlDatum, Text, CrawlDatum> { + CrawlDatum sitemapDatum = null; + CrawlDatum originalDatum = null; + + public void reduce(Text key, Iterable<CrawlDatum> values, Context context) + throws IOException, InterruptedException { + sitemapDatum = null; + originalDatum = null; + + for (CrawlDatum curr: values) { + if(curr.getStatus() == CrawlDatum.STATUS_SITEMAP && sitemapDatum == null) { + sitemapDatum = new CrawlDatum(); + sitemapDatum.set(curr); + } + else { + originalDatum = new CrawlDatum(); + originalDatum.set(curr); + } + } + + if(originalDatum != null) { + // The url was already present in crawldb. If we got the same url from sitemap too, save + // the information from sitemap to the original datum. Emit the original crawl datum + if(sitemapDatum != null) { + context.getCounter("Sitemap", "existing_sitemap_entries").increment(1); + originalDatum.setScore(sitemapDatum.getScore()); Review comment: the [sitemap spec] ( https://www.sitemaps.org/protocol.html#xmlTagDefinitions ) defines "the priority of this URL relative to other URLs on your site." That's different from a global score as calculated by OPIC, page rank, etc. overwriting fetchInterval will make [calculateLastFetchTime()] ( https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java#L158 ) return the wrong time when a page has been fetched overwriting the modified time breaks any if-modified-since handling. See the discussion in NUTCH-1465 ( https://issues.apache.org/jira/browse/NUTCH-1465 ) on Jan 30-31, 2014. I was also wrong how to map these "concepts" from sitemaps to Nutch-internal ones. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#discussion_r113687939

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDatum.java
          ##########
          @@ -90,6 +90,8 @@
          public static final byte STATUS_LINKED = 0x43;
          /** Page got metadata from a parser */
          public static final byte STATUS_PARSE_META = 0x44;
          + /** Page was discovered from sitemap */
          + public static final byte STATUS_SITEMAP = 0x45;

          Review comment:
          Do we really need a new status? STATUS_INJECTED could be also used: both are assigned in the mapper (SitemapMapper resp. InjectMapper) and replaced by STATUS_DB_UNFETCHED in the reducer (SitemapReducer/InjectReducer).

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113687939 ########## File path: src/java/org/apache/nutch/crawl/CrawlDatum.java ########## @@ -90,6 +90,8 @@ public static final byte STATUS_LINKED = 0x43; /** Page got metadata from a parser */ public static final byte STATUS_PARSE_META = 0x44; + /** Page was discovered from sitemap */ + public static final byte STATUS_SITEMAP = 0x45; Review comment: Do we really need a new status? STATUS_INJECTED could be also used: both are assigned in the mapper (SitemapMapper resp. InjectMapper) and replaced by STATUS_DB_UNFETCHED in the reducer (SitemapReducer/InjectReducer). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#discussion_r113684593

          ##########
          File path: conf/nutch-default.xml
          ##########
          @@ -2529,7 +2529,33 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
          <value></value>
          <description>
          Default is 'fanout.key'

          • The routingKey used by publisher to publish messages to specific queues. If the exchange type is "fanout", then this property is ignored.
            + The routingKey used by publisher to publish messages to specific queues.
            + If the exchange type is "fanout", then this property is ignored.
            + </description>
            +</property>
            +
            +<property>

          Review comment:
          These 3 properties are used to transfer command-line options from Hadoop client to tasks. The values are always overwritten, it doesn't make sense to set them here or in nutch-site.xml.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113684593 ########## File path: conf/nutch-default.xml ########## @@ -2529,7 +2529,33 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-- > <value></value> <description> Default is 'fanout.key' The routingKey used by publisher to publish messages to specific queues. If the exchange type is "fanout", then this property is ignored. + The routingKey used by publisher to publish messages to specific queues. + If the exchange type is "fanout", then this property is ignored. + </description> +</property> + +<property> Review comment: These 3 properties are used to transfer command-line options from Hadoop client to tasks. The values are always overwritten, it doesn't make sense to set them here or in nutch-site.xml. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#discussion_r117406027

          ##########
          File path: src/java/org/apache/nutch/util/SitemapProcessor.java
          ##########
          @@ -0,0 +1,436 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.nutch.util;
          +
          +import java.io.IOException;
          +import java.net.URL;
          +import java.text.SimpleDateFormat;
          +import java.util.Collection;
          +import java.util.LinkedList;
          +import java.util.List;
          +import java.util.Random;
          +
          +import org.apache.hadoop.conf.Configuration;
          +import org.apache.hadoop.conf.Configured;
          +import org.apache.hadoop.fs.FileSystem;
          +import org.apache.hadoop.fs.Path;
          +import org.apache.hadoop.io.Text;
          +import org.apache.hadoop.io.Writable;
          +import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.Mapper;
          +import org.apache.hadoop.mapreduce.Reducer;
          +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
          +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
          +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
          +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
          +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
          +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
          +import org.apache.hadoop.util.StringUtils;
          +import org.apache.hadoop.util.Tool;
          +import org.apache.hadoop.util.ToolRunner;
          +
          +import org.apache.nutch.crawl.CrawlDatum;
          +import org.apache.nutch.hostdb.HostDatum;
          +import org.apache.nutch.net.URLFilters;
          +import org.apache.nutch.net.URLNormalizers;
          +import org.apache.nutch.protocol.Content;
          +import org.apache.nutch.protocol.Protocol;
          +import org.apache.nutch.protocol.ProtocolFactory;
          +import org.apache.nutch.protocol.ProtocolOutput;
          +import org.apache.nutch.protocol.ProtocolStatus;
          +
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import crawlercommons.robots.BaseRobotRules;
          +import crawlercommons.sitemaps.AbstractSiteMap;
          +import crawlercommons.sitemaps.SiteMap;
          +import crawlercommons.sitemaps.SiteMapIndex;
          +import crawlercommons.sitemaps.SiteMapParser;
          +import crawlercommons.sitemaps.SiteMapURL;
          +
          +/**
          + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging
          + * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
          + *
          + * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
          + * <ol>
          + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a
          + * list of sitemap links and get only those sitemap pages. This suits well for targeted
          + * crawl of specific hosts.</li>
          + * <li>For open web crawl, it is not possible to track each host and get the sitemap links
          + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the
          + * crawls and inject the urls from sitemap to the crawldb.</li>
          + * </ol>
          + *
          + * <p>For more details see:
          + * https://wiki.apache.org/nutch/SitemapFeature </p>
          + */
          +public class SitemapProcessor extends Configured implements Tool {
          + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class);
          + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
          +
          + public static final String CURRENT_NAME = "current";

          Review comment:
          What is your suggestion here @sebastian-nagel ?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r117406027 ########## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ########## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb.</p> + * + * <p>There are two use cases supported in Nutch's Sitemap processing:</p> + * <ol> + * <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts.</li> + * <li>For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb.</li> + * </ol> + * + * <p>For more details see: + * https://wiki.apache.org/nutch/SitemapFeature </p> + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; Review comment: What is your suggestion here @sebastian-nagel ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/189#issuecomment-302617703

          @sebastian-nagel I've addressed all but two of your comments and responded. I've also implemented parameterized logging. In addition, I've dropped the STATUS_SITEMAP replacing instances with STATUS_INJECTED.
          N.B. when I run this as follows i am not currently able to inject any URLs into the CrawlDB
          ```
          //First I inject a random URL to create a CrawlDB

          lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch inject crawl urls/
          Injector: starting at 2017-05-18 23:01:14
          Injector: crawlDb: crawl
          Injector: urlDir: urls
          Injector: Converting injected urls to crawl db entries.
          Injector: overwrite: false
          Injector: update: false
          Injector: Total urls rejected by filters: 0
          Injector: Total urls injected after normalization and filtering: 1
          Injector: Total urls injected but already in CrawlDb: 0
          Injector: Total new urls injected: 1
          Injector: finished at 2017-05-18 23:01:15, elapsed: 00:00:01

          // I then, attempt to process a sitemap at http://www.autotrader.com/sitemap.xml which I've added to a file located in a 'sitemaps' directory

          lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch sitemap crawl -sitemapUrls sitemaps
          SitemapProcessor: sitemap urls dir: sitemaps
          SitemapProcessor: Starting at 2017-05-18 23:06:38
          robots.txt whitelist not configured.
          SitemapProcessor: Total records rejected by filters: 0
          SitemapProcessor: Total sitemaps from HostDb: 0
          SitemapProcessor: Total sitemaps from seed urls: 1
          SitemapProcessor: Total failed sitemap fetches: 0
          SitemapProcessor: Total new sitemap entries added: 0
          SitemapProcessor: Finished at 2017-05-18 23:06:48, elapsed: 00:00:10

          // Lets read the DB

          lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch readdb crawl -stats
          CrawlDb statistics start: crawl
          Statistics for CrawlDb: crawl
          TOTAL urls: 1
          shortest fetch interval: 30 days, 00:00:00
          avg fetch interval: 30 days, 00:00:00
          longest fetch interval: 30 days, 00:00:00
          earliest fetch time: Thu May 18 23:01:00 PDT 2017
          avg of fetch times: Thu May 18 23:01:00 PDT 2017
          latest fetch time: Thu May 18 23:01:00 PDT 2017
          retry 0: 1
          min score: 1.0
          avg score: 1.0
          max score: 1.0
          status 1 (db_unfetched): 1
          CrawlDb statistics: done
          ```
          As you can see no URLs seem to be processed as the new sitemap entries added is zero, this is confirmed by the readdb output.
          I need to do some more debugging and see where the bug(s) are. If anyone is able to try this patch out and has an interest in Sitemap support in Nutch master it would be highly appreciated.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#issuecomment-302617703 @sebastian-nagel I've addressed all but two of your comments and responded. I've also implemented parameterized logging. In addition, I've dropped the STATUS_SITEMAP replacing instances with STATUS_INJECTED. N.B. when I run this as follows i am not currently able to inject any URLs into the CrawlDB ``` //First I inject a random URL to create a CrawlDB lmcgibbn@LMC-056430 /usr/local/nutch( NUTCH-1465 ) $ ./runtime/local/bin/nutch inject crawl urls/ Injector: starting at 2017-05-18 23:01:14 Injector: crawlDb: crawl Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 0 Injector: Total new urls injected: 1 Injector: finished at 2017-05-18 23:01:15, elapsed: 00:00:01 // I then, attempt to process a sitemap at http://www.autotrader.com/sitemap.xml which I've added to a file located in a 'sitemaps' directory lmcgibbn@LMC-056430 /usr/local/nutch( NUTCH-1465 ) $ ./runtime/local/bin/nutch sitemap crawl -sitemapUrls sitemaps SitemapProcessor: sitemap urls dir: sitemaps SitemapProcessor: Starting at 2017-05-18 23:06:38 robots.txt whitelist not configured. SitemapProcessor: Total records rejected by filters: 0 SitemapProcessor: Total sitemaps from HostDb: 0 SitemapProcessor: Total sitemaps from seed urls: 1 SitemapProcessor: Total failed sitemap fetches: 0 SitemapProcessor: Total new sitemap entries added: 0 SitemapProcessor: Finished at 2017-05-18 23:06:48, elapsed: 00:00:10 // Lets read the DB lmcgibbn@LMC-056430 /usr/local/nutch( NUTCH-1465 ) $ ./runtime/local/bin/nutch readdb crawl -stats CrawlDb statistics start: crawl Statistics for CrawlDb: crawl TOTAL urls: 1 shortest fetch interval: 30 days, 00:00:00 avg fetch interval: 30 days, 00:00:00 longest fetch interval: 30 days, 00:00:00 earliest fetch time: Thu May 18 23:01:00 PDT 2017 avg of fetch times: Thu May 18 23:01:00 PDT 2017 latest fetch time: Thu May 18 23:01:00 PDT 2017 retry 0: 1 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched): 1 CrawlDb statistics: done ``` As you can see no URLs seem to be processed as the new sitemap entries added is zero, this is confirmed by the readdb output. I need to do some more debugging and see where the bug(s) are. If anyone is able to try this patch out and has an interest in Sitemap support in Nutch master it would be highly appreciated. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          markus17 Markus Jelsma added a comment -

          Updated patch for trunk:

          • added some curly braces to if statements, that kind of formatting always screws me at some point;
          • added support for redirects, in hostdb mode, a url is built for url filtering, but the actual protocol can be https instead, so redirect;
          • added support for defaulting to /sitemap.xml, some robots.txt do not properly point to the map
          • added support for NOT OVERWRITING existing CrawlDatum information and made it the default option, letting external sitemap overwrite interval is a very bad idea.
          Show
          markus17 Markus Jelsma added a comment - Updated patch for trunk: added some curly braces to if statements, that kind of formatting always screws me at some point; added support for redirects, in hostdb mode, a url is built for url filtering, but the actual protocol can be https instead, so redirect; added support for defaulting to /sitemap.xml, some robots.txt do not properly point to the map added support for NOT OVERWRITING existing CrawlDatum information and made it the default option, letting external sitemap overwrite interval is a very bad idea.
          Hide
          markus17 Markus Jelsma added a comment -

          Updated patch:

          • corrected implementation for not overwriting existing entries
          • CrawlDB is not emitted via MapOutputFormat instead of SequenceFileOutputFormat
          Show
          markus17 Markus Jelsma added a comment - Updated patch: corrected implementation for not overwriting existing entries CrawlDB is not emitted via MapOutputFormat instead of SequenceFileOutputFormat
          Hide
          markus17 Markus Jelsma added a comment -

          There is an oddity going on when a sitemap.xml entry is listed twice. It then assumes the db_status INJECTED and overwrites existing CrawlDatum completely.

          Show
          markus17 Markus Jelsma added a comment - There is an oddity going on when a sitemap.xml entry is listed twice. It then assumes the db_status INJECTED and overwrites existing CrawlDatum completely.
          Hide
          markus17 Markus Jelsma added a comment - - edited

          Ah, removing the NULL check in the reducer solves the problem. The existing entries are no longer overwritten. This was visible with readdb -stats, showing an amount of records with status INJECTED.

          Show
          markus17 Markus Jelsma added a comment - - edited Ah, removing the NULL check in the reducer solves the problem. The existing entries are no longer overwritten. This was visible with readdb -stats, showing an amount of records with status INJECTED.
          Hide
          lewismc Lewis John McGibbney added a comment -

          Fantastic Markus Jelsma is this working well for you? I am going to try this out. Out of curiosity, is this based off the the Github PR or the various patches which are associated with this issue? I am curious as I've seen quite a lot of variability in the implementations.

          Show
          lewismc Lewis John McGibbney added a comment - Fantastic Markus Jelsma is this working well for you? I am going to try this out. Out of curiosity, is this based off the the Github PR or the various patches which are associated with this issue? I am curious as I've seen quite a lot of variability in the implementations.
          Hide
          markus17 Markus Jelsma added a comment -

          Hi Lewis!

          It appears to be working fine now and bug-free due to not having the input overwrite existing CrawlDb entry interval and modified times because:

          • that is messy in Nutch
          • websites tend to set bad values, almost always, such as 100k large websites signaling to refetch everything daily

          We have it deployed but not activated, that's the plan for early next week.

          The patch is based on the mess in this thread's latest comments, and most recent scraps i found on Github. It should be the most recent contributions you guys added.

          Show
          markus17 Markus Jelsma added a comment - Hi Lewis! It appears to be working fine now and bug-free due to not having the input overwrite existing CrawlDb entry interval and modified times because: that is messy in Nutch websites tend to set bad values, almost always, such as 100k large websites signaling to refetch everything daily We have it deployed but not activated, that's the plan for early next week. The patch is based on the mess in this thread's latest comments, and most recent scraps i found on Github. It should be the most recent contributions you guys added.
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc opened a new pull request #195: NUTCH-1465 Support sitemaps in Nutch
          URL: https://github.com/apache/nutch/pull/195

          Hi folks, this issue is a mirror of Markus' latest patch over on https://issues.apache.org/jira/browse/NUTCH-1465, this is merely for improved review.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc opened a new pull request #195: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/195 Hi folks, this issue is a mirror of Markus' latest patch over on https://issues.apache.org/jira/browse/NUTCH-1465 , this is merely for improved review. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          lewismc Lewis John McGibbney added a comment -

          Hi Markus Jelsma I went ahead and generated a PR for others to review over at https://github.com/apache/nutch/pull/195

          Show
          lewismc Lewis John McGibbney added a comment - Hi Markus Jelsma I went ahead and generated a PR for others to review over at https://github.com/apache/nutch/pull/195
          Hide
          lewismc Lewis John McGibbney added a comment -

          Markus Jelsma when attempting to process the following sitemap - http://www.autotrader.com/sitemap.xml, it appears the new processor is not able to process anything... although the crawldb data structures are produced, no entries are added... can you please rescope the patch and ensure it is the most up-to-date one you are working with? Thanks

          2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total records rejected by filters: 0
          2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total sitemaps from HostDb: 0
          2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total sitemaps from seed urls: 1
          2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total failed sitemap fetches: 0
          2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total new sitemap entries added: 0
          2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Finished at 2017-07-03 15:32:09, elapsed: 00:00:19
          
          Show
          lewismc Lewis John McGibbney added a comment - Markus Jelsma when attempting to process the following sitemap - http://www.autotrader.com/sitemap.xml , it appears the new processor is not able to process anything... although the crawldb data structures are produced, no entries are added... can you please rescope the patch and ensure it is the most up-to-date one you are working with? Thanks 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total records rejected by filters: 0 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total sitemaps from HostDb: 0 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total sitemaps from seed urls: 1 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total failed sitemap fetches: 0 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total new sitemap entries added: 0 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Finished at 2017-07-03 15:32:09, elapsed: 00:00:19
          Hide
          markus17 Markus Jelsma added a comment -

          Hello Lewis, I am positive i took the latest pieces. And checking out the GH page, that problem wasn't solved in the first place right? Or am i missing something? https://github.com/apache/nutch/pull/189#discussion_r113578491

          Show
          markus17 Markus Jelsma added a comment - Hello Lewis, I am positive i took the latest pieces. And checking out the GH page, that problem wasn't solved in the first place right? Or am i missing something? https://github.com/apache/nutch/pull/189#discussion_r113578491
          Hide
          markus17 Markus Jelsma added a comment -

          Ah, i see. The autotrader sitemap points to an index of sitemaps. Everything is fine except it does not pass if(sitemap.isIndex()). When printing its getType() i get null. So something is either wrong with the sitemapindex, crawler commons, or myself.

          Show
          markus17 Markus Jelsma added a comment - Ah, i see. The autotrader sitemap points to an index of sitemaps. Everything is fine except it does not pass if(sitemap.isIndex()). When printing its getType() i get null. So something is either wrong with the sitemapindex, crawler commons, or myself.
          Hide
          lewismc Lewis John McGibbney added a comment -

          Markus Jelsma can we also update the version of crawler commons to 0.8 which is the latest version available in Maven Central? I'll take a look into the processing logic once the update has been made. Thanks Markus.

          Show
          lewismc Lewis John McGibbney added a comment - Markus Jelsma can we also update the version of crawler commons to 0.8 which is the latest version available in Maven Central? I'll take a look into the processing logic once the update has been made. Thanks Markus.
          Hide
          markus17 Markus Jelsma added a comment -

          Hi Lewis, 0.8 doesn't deal with this sitemap at autotrader too.

          Show
          markus17 Markus Jelsma added a comment - Hi Lewis, 0.8 doesn't deal with this sitemap at autotrader too.
          Hide
          markus17 Markus Jelsma added a comment -

          Anyway, here's the patch with crawler-commons 0.8.

          Show
          markus17 Markus Jelsma added a comment - Anyway, here's the patch with crawler-commons 0.8.
          Hide
          markus17 Markus Jelsma added a comment -

          I think this is committable, anyone to disagree? If not, i'll get this in early next week.

          Show
          markus17 Markus Jelsma added a comment - I think this is committable, anyone to disagree? If not, i'll get this in early next week.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Thanks, Markus Jelsma! Tested on a small set of sitemaps. Looks good to me, I've only improved the description of properties and did some code clean-up (patch / pull-request to follow). Please, go ahead and commit it! We can later improve it to make it more robust or to make sophisticated use of last modified time and priorities provided in sitemaps. Thanks!

          Show
          wastl-nagel Sebastian Nagel added a comment - Thanks, Markus Jelsma ! Tested on a small set of sitemaps. Looks good to me, I've only improved the description of properties and did some code clean-up (patch / pull-request to follow). Please, go ahead and commit it! We can later improve it to make it more robust or to make sophisticated use of last modified time and priorities provided in sitemaps. Thanks!
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel opened a new pull request #202: NUTCH-1465 Support for sitemaps
          URL: https://github.com/apache/nutch/pull/202

          (applied Markus' patch as of 2017-07-05)

          • add SitemapProcessor
          • upgrade dependency crawler-commons to 0.8

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel opened a new pull request #202: NUTCH-1465 Support for sitemaps URL: https://github.com/apache/nutch/pull/202 (applied Markus' patch as of 2017-07-05) add SitemapProcessor upgrade dependency crawler-commons to 0.8 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          markus17 Markus Jelsma added a comment -

          Thanks! Will grab 202.patch and see if it fits tomorrow!

          Show
          markus17 Markus Jelsma added a comment - Thanks! Will grab 202.patch and see if it fits tomorrow!
          Hide
          markus17 Markus Jelsma added a comment -

          Sebastian, your patch has CrawlDatum and IndexingFilterChecker in the patch as well, just for the newline at the tail. No problem, but i do miss your updated descripton of the properties. Cannot find them in https://github.com/apache/nutch/pull/202.patch

          Show
          markus17 Markus Jelsma added a comment - Sebastian, your patch has CrawlDatum and IndexingFilterChecker in the patch as well, just for the newline at the tail. No problem, but i do miss your updated descripton of the properties. Cannot find them in https://github.com/apache/nutch/pull/202.patch
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          I've modified the description of the properties sitemap.strict.parsing and sitemap.url.overwrite.existing. But feel free add your modifications/additions. I just tried to make it understandable by anyone who does not know the gory details.

          Show
          wastl-nagel Sebastian Nagel added a comment - I've modified the description of the properties sitemap.strict.parsing and sitemap.url.overwrite.existing . But feel free add your modifications/additions. I just tried to make it understandable by anyone who does not know the gory details.
          Hide
          markus17 Markus Jelsma added a comment -

          Crap! I was probably looking without seeing! Got it!

          Show
          markus17 Markus Jelsma added a comment - Crap! I was probably looking without seeing! Got it!
          Hide
          markus17 Markus Jelsma added a comment -

          remote: 2dc7472..8f556f4 8f556f4a87d87edb96fb575fa4b579e39d9dfdb4 -> master

          Thanks Tejas, Sebastian, Lewis, Ken!

          Show
          markus17 Markus Jelsma added a comment - remote: 2dc7472..8f556f4 8f556f4a87d87edb96fb575fa4b579e39d9dfdb4 -> master Thanks Tejas, Sebastian, Lewis, Ken!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Nutch-trunk #3435 (See https://builds.apache.org/job/Nutch-trunk/3435/)
          NUTCH-1465 (markus: https://github.com/apache/nutch/commit/b58d6cd9111b2d25b8f6f009015ac214bac4006d)

          • (edit) conf/log4j.properties
          • (add) src/java/org/apache/nutch/util/SitemapProcessor.java
          • (edit) ivy/ivy.xml
          • (edit) conf/nutch-default.xml
          • (edit) src/bin/nutch
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Nutch-trunk #3435 (See https://builds.apache.org/job/Nutch-trunk/3435/ ) NUTCH-1465 (markus: https://github.com/apache/nutch/commit/b58d6cd9111b2d25b8f6f009015ac214bac4006d ) (edit) conf/log4j.properties (add) src/java/org/apache/nutch/util/SitemapProcessor.java (edit) ivy/ivy.xml (edit) conf/nutch-default.xml (edit) src/bin/nutch

            People

            • Assignee:
              markus17 Markus Jelsma
              Reporter:
              lewismc Lewis John McGibbney
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development