Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1526

Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

    XMLWordPrintableJSON

Details

    Description

      It only took me 1.2 years, but I finally got around to it. This patch will deliver a SegmentContentDumper tool per the description here:

      http://s.apache.org/kv

      And per the interface here:

      ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
         -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
         -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
         -outputDir The output directory to write file names to.
         -metadata --key=value where key is a Content Metadata key and value is a value to check.
      

      If the URL and its content metadata have a matching key,value pair, dump it. Allow for regex matching on the value.

      Attachments

        1. NUTCH-1526.Mattmann.090514.patch.txt
          5 kB
          Chris A. Mattmann

        Activity

          People

            chrismattmann Chris A. Mattmann
            chrismattmann Chris A. Mattmann
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: