Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1325

HostDB for Nutch

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.12
    • hostdb
    • None
    • Patch Available

    Description

      HostDB for Apache Nutch 1.x

      • automatically generates a HostDB based on CrawlDB information
      • periodically performs DNS lookup for all hosts and keeps track of DNS failures
      • discovers homepage if www.example.org/ is a redirect
      • keeps track of host statistics such as number of URL's, 404's, not modifieds and redirects
      • aggregates CrawlDB metadata fields into totals, sums, min, max, average and configurable percentiles
      • can output lists of discovered homepage URL's for seed lists and static fetch interval
        *can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB using domainblacklist-urlfilter
      • just like CrawlDB support for JEXL expressions

      Examples

      Generate for the first time, or update and existing HostDB:

      bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
      

      Optional filtering or normalizing:

      bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter -normalize
      

      Dumping as CSV file:

      bin/nutch readhostdb crawl/hostdb output_directory
      

      Get only hostnames with have average response time above 50ms:

      bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(avg._rs_ > 50)"
      

      Get only hosts that have over 50% 404's:

      bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(gone / numRecords > 0.5)"
      

      For JEXL expressions, all host metadata fields are available. All other fields are also available as:

      unfetched – number of unfetched records
      fetched – number of fetched records
      gone – number of 404's
      redirTemp – number if temporary redirects
      redirPerm – number if permanent redirects
      redirs – total number of redirects (redirTemp + redirPerm)
      notModified – number of not modified records
      ok – number of usable pages (fetched + notModified)
      numRecords – total number of records
      dnsFailures – number of DNS failures

      Also, see nutch-default for hostdb.* properties.

      Attachments

        1. oi-hostdb.patch
          53 kB
          Markus Jelsma
        2. oi-hostdb.patch
          57 kB
          Markus Jelsma
        3. oi-hostdb.patch
          56 kB
          Markus Jelsma
        4. NUTCH-1325-v4-v5.patch
          6 kB
          Gui Forget
        5. NUTCH-1325-trunk-v5.patch
          45 kB
          Gui Forget
        6. NUTCH-1325-trunk-v4.patch
          45 kB
          Tejas Patil
        7. NUTCH-1325-trunk-v3.patch
          44 kB
          Tejas Patil
        8. NUTCH-1325-removed-from-1.8.patch
          44 kB
          Markus Jelsma
        9. NUTCH-1325-1.6-1.patch
          43 kB
          Markus Jelsma
        10. NUTCH-1325.trunk.v2.path
          44 kB
          Tejas Patil
        11. NUTCH-1325.patch
          63 kB
          Markus Jelsma
        12. NUTCH-1325.patch
          63 kB
          Markus Jelsma
        13. NUTCH-1325.patch
          63 kB
          Markus Jelsma
        14. NUTCH-1325.patch
          64 kB
          Markus Jelsma

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              markus17 Markus Jelsma
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: