Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.12
    • Component/s: hostdb
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      HostDB for Apache Nutch 1.x

      • automatically generates a HostDB based on CrawlDB information
      • periodically performs DNS lookup for all hosts and keeps track of DNS failures
      • discovers homepage if www.example.org/ is a redirect
      • keeps track of host statistics such as number of URL's, 404's, not modifieds and redirects
      • aggregates CrawlDB metadata fields into totals, sums, min, max, average and configurable percentiles
      • can output lists of discovered homepage URL's for seed lists and static fetch interval
        *can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB using domainblacklist-urlfilter
      • just like CrawlDB support for JEXL expressions

      Examples

      Generate for the first time, or update and existing HostDB:

      bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
      

      Optional filtering or normalizing:

      bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter -normalize
      

      Dumping as CSV file:

      bin/nutch readhostdb crawl/hostdb output_directory
      

      Get only hostnames with have average response time above 50ms:

      bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(avg._rs_ > 50)"
      

      Get only hosts that have over 50% 404's:

      bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(gone / numRecords > 0.5)"
      

      For JEXL expressions, all host metadata fields are available. All other fields are also available as:

      unfetched – number of unfetched records
      fetched – number of fetched records
      gone – number of 404's
      redirTemp – number if temporary redirects
      redirPerm – number if permanent redirects
      redirs – total number of redirects (redirTemp + redirPerm)
      notModified – number of not modified records
      ok – number of usable pages (fetched + notModified)
      numRecords – total number of records
      dnsFailures – number of DNS failures

      Also, see nutch-default for hostdb.* properties.

      1. oi-hostdb.patch
        53 kB
        Markus Jelsma
      2. oi-hostdb.patch
        57 kB
        Markus Jelsma
      3. oi-hostdb.patch
        56 kB
        Markus Jelsma
      4. NUTCH-1325-v4-v5.patch
        6 kB
        Gui Forget
      5. NUTCH-1325-trunk-v5.patch
        45 kB
        Gui Forget
      6. NUTCH-1325-trunk-v4.patch
        45 kB
        Tejas Patil
      7. NUTCH-1325-trunk-v3.patch
        44 kB
        Tejas Patil
      8. NUTCH-1325-removed-from-1.8.patch
        44 kB
        Markus Jelsma
      9. NUTCH-1325-1.6-1.patch
        43 kB
        Markus Jelsma
      10. NUTCH-1325.trunk.v2.path
        44 kB
        Tejas Patil
      11. NUTCH-1325.patch
        63 kB
        Markus Jelsma
      12. NUTCH-1325.patch
        63 kB
        Markus Jelsma
      13. NUTCH-1325.patch
        63 kB
        Markus Jelsma
      14. NUTCH-1325.patch
        64 kB
        Markus Jelsma

        Issue Links

          Activity

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development