Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2408

CrawlDb: allow update from unparsed segments

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: crawldb
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The command updatedb (class o.a.n.crawl.CrawlDb) does not allow to update the CrawlDb with fetch status only (from segment subdirectory crawl_fetch) without also reading crawl_parse (which contains outlinks but also scores, signatures and meta data).

      A workflow which does not require parsing of documents (e.g., because raw HTML content is exported to WARC files) is then unable to update the CrawlDb to store the fetch status.

        Attachments

          Activity

            People

            • Assignee:
              wastl-nagel Sebastian Nagel
              Reporter:
              wastl-nagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: