Details
Description
It only took me 1.2 years, but I finally got around to it. This patch will deliver a SegmentContentDumper tool per the description here:
And per the interface here:
./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options] -segmentRootDir full file path to the root segment directory, e.g., crawl/segments -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment -outputDir The output directory to write file names to. -metadata --key=value where key is a Content Metadata key and value is a value to check.
If the URL and its content metadata have a matching key,value pair, dump it. Allow for regex matching on the value.