Description
As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse exact contents for specific websites. I've developed a plugin parse-jsoup using Jsoup for my current project to extract precise content for site specific crawling using detailed XML configuration(field name, CSS-selector, attribute, extraction rules, data-type, default-value etc).
Please let me know if this feature seems relevant and currently not present in Nutch. I have also plan to export it into Nutch 1.x.