Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1325

HostDB for Nutch



    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.12
    • Component/s: hostdb
    • Labels:
    • Patch Info:
      Patch Available


      HostDB for Apache Nutch 1.x

      • automatically generates a HostDB based on CrawlDB information
      • periodically performs DNS lookup for all hosts and keeps track of DNS failures
      • discovers homepage if www.example.org/ is a redirect
      • keeps track of host statistics such as number of URL's, 404's, not modifieds and redirects
      • aggregates CrawlDB metadata fields into totals, sums, min, max, average and configurable percentiles
      • can output lists of discovered homepage URL's for seed lists and static fetch interval
        *can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB using domainblacklist-urlfilter
      • just like CrawlDB support for JEXL expressions


      Generate for the first time, or update and existing HostDB:

      bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb

      Optional filtering or normalizing:

      bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter -normalize

      Dumping as CSV file:

      bin/nutch readhostdb crawl/hostdb output_directory

      Get only hostnames with have average response time above 50ms:

      bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(avg._rs_ > 50)"

      Get only hosts that have over 50% 404's:

      bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(gone / numRecords > 0.5)"

      For JEXL expressions, all host metadata fields are available. All other fields are also available as:

      unfetched – number of unfetched records
      fetched – number of fetched records
      gone – number of 404's
      redirTemp – number if temporary redirects
      redirPerm – number if permanent redirects
      redirs – total number of redirects (redirTemp + redirPerm)
      notModified – number of not modified records
      ok – number of usable pages (fetched + notModified)
      numRecords – total number of records
      dnsFailures – number of DNS failures

      Also, see nutch-default for hostdb.* properties.


        1. NUTCH-1325.patch
          64 kB
          Markus Jelsma
        2. NUTCH-1325.patch
          63 kB
          Markus Jelsma
        3. NUTCH-1325.patch
          63 kB
          Markus Jelsma
        4. NUTCH-1325.patch
          63 kB
          Markus Jelsma
        5. NUTCH-1325.trunk.v2.path
          44 kB
          Tejas Patil
        6. NUTCH-1325-1.6-1.patch
          43 kB
          Markus Jelsma
        7. NUTCH-1325-removed-from-1.8.patch
          44 kB
          Markus Jelsma
        8. NUTCH-1325-trunk-v3.patch
          44 kB
          Tejas Patil
        9. NUTCH-1325-trunk-v4.patch
          45 kB
          Tejas Patil
        10. NUTCH-1325-trunk-v5.patch
          45 kB
          Gui Forget
        11. NUTCH-1325-v4-v5.patch
          6 kB
          Gui Forget
        12. oi-hostdb.patch
          56 kB
          Markus Jelsma
        13. oi-hostdb.patch
          57 kB
          Markus Jelsma
        14. oi-hostdb.patch
          53 kB
          Markus Jelsma

          Issue Links



              • Assignee:
                markus17 Markus Jelsma
                markus17 Markus Jelsma
              • Votes:
                2 Vote for this issue
                9 Start watching this issue


                • Created: