I agree about wanting the decision not just in fetcher while parsing but also in parse segment. Here is the problem as I see it in returning null content.
Say we are wanting to create a topical search engine about sports. We fetch pages. Run through a fetch filter for a yes/no is this page in sports by its content. If we null out content and from that ParseText and ParseData, we still have the CrawlDatum to deal with. If we leave it as is, the CrawlDatum will get updated into CrawlDb as successfully fetched. Content and Parse won't get collected because they are null. We won't have the problem of Outlinks on that page getting queued in CrawlDb but the original URL will still be there and will be queued after an interval for repeated crawling. Over time what we have is a large number of URLs that we know to be filtered being repeatedly crawled.
The decision point isn't just keep the content. It is should we keep the URL and its content/parse and continue crawling down the path of the URLs outlinks or should we ignore this URL and not crawl anything it points to, break the crawl graph at this point. Hence FetchFilter. My solution to this was to null out content/parse and add a different CrawlDatum that essentially said the page was gone. Ideally we should have a separate status but the gone worked as a first pass. This gets updated back into CrawlDb and won't get recrawled at a later date This was only possible in the Fetcher though.
Thoughts on how we might approach this?