Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.12
    • Component/s: hostdb
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      HostDB for Apache Nutch 1.x

      • automatically generates a HostDB based on CrawlDB information
      • periodically performs DNS lookup for all hosts and keeps track of DNS failures
      • discovers homepage if www.example.org/ is a redirect
      • keeps track of host statistics such as number of URL's, 404's, not modifieds and redirects
      • aggregates CrawlDB metadata fields into totals, sums, min, max, average and configurable percentiles
      • can output lists of discovered homepage URL's for seed lists and static fetch interval
        *can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB using domainblacklist-urlfilter
      • just like CrawlDB support for JEXL expressions

      Examples

      Generate for the first time, or update and existing HostDB:

      bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
      

      Optional filtering or normalizing:

      bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter -normalize
      

      Dumping as CSV file:

      bin/nutch readhostdb crawl/hostdb output_directory
      

      Get only hostnames with have average response time above 50ms:

      bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(avg._rs_ > 50)"
      

      Get only hosts that have over 50% 404's:

      bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(gone / numRecords > 0.5)"
      

      For JEXL expressions, all host metadata fields are available. All other fields are also available as:

      unfetched – number of unfetched records
      fetched – number of fetched records
      gone – number of 404's
      redirTemp – number if temporary redirects
      redirPerm – number if permanent redirects
      redirs – total number of redirects (redirTemp + redirPerm)
      notModified – number of not modified records
      ok – number of usable pages (fetched + notModified)
      numRecords – total number of records
      dnsFailures – number of DNS failures

      Also, see nutch-default for hostdb.* properties.

      1. NUTCH-1325.patch
        64 kB
        Markus Jelsma
      2. NUTCH-1325.patch
        63 kB
        Markus Jelsma
      3. NUTCH-1325.patch
        63 kB
        Markus Jelsma
      4. NUTCH-1325.patch
        63 kB
        Markus Jelsma
      5. NUTCH-1325.trunk.v2.path
        44 kB
        Tejas Patil
      6. NUTCH-1325-1.6-1.patch
        43 kB
        Markus Jelsma
      7. NUTCH-1325-removed-from-1.8.patch
        44 kB
        Markus Jelsma
      8. NUTCH-1325-trunk-v3.patch
        44 kB
        Tejas Patil
      9. NUTCH-1325-trunk-v4.patch
        45 kB
        Tejas Patil
      10. NUTCH-1325-trunk-v5.patch
        45 kB
        Gui Forget
      11. NUTCH-1325-v4-v5.patch
        6 kB
        Gui Forget
      12. oi-hostdb.patch
        56 kB
        Markus Jelsma
      13. oi-hostdb.patch
        57 kB
        Markus Jelsma
      14. oi-hostdb.patch
        53 kB
        Markus Jelsma

        Issue Links

          Activity

          Markus Jelsma made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Markus Jelsma made changes -
          Component/s hostdb [ 12328749 ]
          Markus Jelsma made changes -
          Fix Version/s 1.12 [ 12333328 ]
          Markus Jelsma made changes -
          Attachment NUTCH-1325.patch [ 12783585 ]
          Markus Jelsma made changes -
          Description A HostDB for Nutch and associated tools to create and read a database containing information on hosts.
          h1. HostDB for Apache Nutch 1.x

          * automatically generates a HostDB based on CrawlDB information
          * periodically performs DNS lookup for all hosts and keeps track of DNS failures
          * discovers homepage if www.example.org/ is a redirect
          * keeps track of host statistics such as number of URL's, 404's, not modifieds and redirects
          * aggregates CrawlDB metadata fields into totals, sums, min, max, average and configurable percentiles
          * can output lists of discovered homepage URL's for seed lists and static fetch interval
          *can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB using domainblacklist-urlfilter
          * just like CrawlDB support for JEXL expressions

          h4. Examples

          Generate for the first time, or update and existing HostDB:
          {code}
          bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
          {code}

          Optional filtering or normalizing:
          {code}
          bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter -normalize
          {code}

          Dumping as CSV file:
          {code}
          bin/nutch readhostdb crawl/hostdb output_directory
          {code}

          Get only hostnames with have average response time above 50ms:
          {code}
          bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(avg._rs_ > 50)"
          {code}

          Get only hosts that have over 50% 404's:
          {code}
          bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(gone / numRecords > 0.5)"
          {code}

          For JEXL expressions, all host metadata fields are available. All other fields are also available as:

          unfetched -- number of unfetched records
          fetched -- number of fetched records
          gone -- number of 404's
          redirTemp -- number if temporary redirects
          redirPerm -- number if permanent redirects
          redirs -- total number of redirects (redirTemp + redirPerm)
          notModified -- number of not modified records
          ok -- number of usable pages (fetched + notModified)
          numRecords -- total number of records
          dnsFailures -- number of DNS failures

          Also, see nutch-default for hostdb.* properties.
          Patch Info Patch Available [ 10042 ]
          Markus Jelsma made changes -
          Attachment NUTCH-1325.patch [ 12783575 ]
          Markus Jelsma made changes -
          Attachment NUTCH-1325.patch [ 12783571 ]
          Markus Jelsma made changes -
          Attachment NUTCH-1325.patch [ 12783141 ]
          Markus Jelsma made changes -
          Assignee Markus Jelsma [ markus17 ]
          Markus Jelsma made changes -
          Link This issue supercedes NUTCH-1149 [ NUTCH-1149 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.11 [ 12329358 ]
          Markus Jelsma made changes -
          Attachment oi-hostdb.patch [ 12707752 ]
          Markus Jelsma made changes -
          Attachment oi-hostdb.patch [ 12707743 ]
          Markus Jelsma made changes -
          Assignee Tejas Patil [ tejasp ]
          Markus Jelsma made changes -
          Attachment oi-hostdb.patch [ 12707509 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.11 [ 12329358 ]
          Fix Version/s 1.10 [ 12327187 ]
          Gui Forget made changes -
          Attachment NUTCH-1325-trunk-v5.patch [ 12673518 ]
          Attachment NUTCH-1325-v4-v5.patch [ 12673519 ]
          Julien Nioche made changes -
          Fix Version/s 1.10 [ 12327187 ]
          Fix Version/s 1.9 [ 12324611 ]
          Markus Jelsma made changes -
          Fix Version/s 1.9 [ 12324611 ]
          Fix Version/s 1.8 [ 12324326 ]
          Markus Jelsma made changes -
          Attachment NUTCH-1325-removed-from-1.8.patch [ 12633351 ]
          Markus Jelsma made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Tejas Patil made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 1.8 [ 12324326 ]
          Fix Version/s 1.9 [ 12324611 ]
          Resolution Fixed [ 1 ]
          Tejas Patil made changes -
          Assignee Markus Jelsma [ markus17 ] Tejas Patil [ tejasp ]
          Tejas Patil made changes -
          Attachment NUTCH-1325-trunk-v4.patch [ 12624178 ]
          Tejas Patil made changes -
          Attachment NUTCH-1325-trunk-v4.patch [ 12624171 ]
          Tejas Patil made changes -
          Attachment NUTCH-1325-trunk-v4.patch [ 12624171 ]
          Tejas Patil made changes -
          Attachment NUTCH-1325-trunk-v3.patch [ 12621024 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.9 [ 12324611 ]
          Fix Version/s 1.7 [ 12323281 ]
          Tejas Patil made changes -
          Attachment NUTCH-1325.trunk.v2.path [ 12582836 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 1.6 [ 12319941 ]
          Markus Jelsma made changes -
          Field Original Value New Value
          Attachment NUTCH-1325-1.6-1.patch [ 12526332 ]
          Markus Jelsma created issue -

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development