A parse filter that takes a regex and a field name. If regex matches via matcher.find() on the HTML. The field name is set to true in the CrawlDatum's metadata.
Combined with the HostDB, it is easy to get a list of hosts that match some regex criteria.
# Example configuration file for parsefilter-regex
# Parse metadata field <name> is set to true if the HTML matches the regex. The
# source can either be html or text. If source is html, the regex is applied to
# the entire HTML tree. If source is text, the regex is applied to the
# extracted text.
# format: <name>\t<source>\t<regex>\n