Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-784

CrawlDBScanner

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.1
    • None
    • None
    • Patch Available

    Description

      The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at.

      The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB.

      Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>

      regex: regular expression on the crawldb key
      -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched
      -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat

      for instance the command below :
      ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text

      will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched

      Attachments

        1. NUTCH-784.patch
          6 kB
          Julien Nioche

        Issue Links

          Activity

            People

              jnioche Julien Nioche
              jnioche Julien Nioche
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: