Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2408

CrawlDb: allow update from unparsed segments

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.13
    • 1.14
    • crawldb
    • None
    • Patch Available

    Description

      The command updatedb (class o.a.n.crawl.CrawlDb) does not allow to update the CrawlDb with fetch status only (from segment subdirectory crawl_fetch) without also reading crawl_parse (which contains outlinks but also scores, signatures and meta data).

      A workflow which does not require parsing of documents (e.g., because raw HTML content is exported to WARC files) is then unable to update the CrawlDb to store the fetch status.

      Attachments

        Activity

          People

            snagel Sebastian Nagel
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: