Description
A parse filter that takes a regex and a field name. If regex matches via matcher.find() on the HTML. The field name is set to true in the CrawlDatum's metadata.
Combined with the HostDB, it is easy to get a list of hosts that match some regex criteria.
# Example configuration file for parsefilter-regex # # Parse metadata field <name> is set to true if the HTML matches the regex. The # source can either be html or text. If source is html, the regex is applied to # the entire HTML tree. If source is text, the regex is applied to the # extracted text. # # format: <name>\t<source>\t<regex>\n