The latest SVN version already contains similar code (see parse-html/..../HTMLMetaProcessor.java). The only thing that is missing is to put the content meta tags into ParseData.metadata.
As you know, we actually have two places to put metadata into: one is Protocol.metadata, where all protocol-related metadata should be stored, and the other is ParseData.metadata, where parse-related metadata should be stored, which is the case here.
However... potentially this may overwrite other properties coming from protocol handlers, or discovered by other plugins or other parts of the code. E.g. the "lang" tag is such example, "content-encoding" and "charset" are other examples. The language identifier plugin works around this by using an "X-meta-lang" property name. (BTW: it could be cleaned up to avoid traversing the node tree once again, but instead make use of the already discovered meta tags, which are now passed as an argument to HtmlParseFilters).
I suggest to rework this to use a consistent schema in both cases (i.e. Content.metadata and ParseData.metadata): let's put them under "X-nutch-<name>
" (where <name> is e.g. the value of the key retrieved from HtmlMetaTags.getGeneralTags()), or "X-nutch-http-equiv<name>" prefix (where name is the value of the key retrieved from HtmlMetaTags.getHtpEquivTags)), and so on. So, this would be e.g. "X-nutch-robots", "X-nutch-base", "X-nutch-http-equiv-pragma", "X-nutch-http-equiv-refresh").
This way we can store all <meta> information, without any danger of overwriting the original values.