Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1526

Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

    XMLWordPrintableJSON

    Details

      Description

      It only took me 1.2 years, but I finally got around to it. This patch will deliver a SegmentContentDumper tool per the description here:

      http://s.apache.org/kv

      And per the interface here:

      ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
         -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
         -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
         -outputDir The output directory to write file names to.
         -metadata --key=value where key is a Content Metadata key and value is a value to check.
      

      If the URL and its content metadata have a matching key,value pair, dump it. Allow for regex matching on the value.

        Attachments

        1. NUTCH-1526.Mattmann.090514.patch.txt
          5 kB
          Chris A. Mattmann

          Activity

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              chrismattmann Chris A. Mattmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: