I downloaded the html page suggested by Jayesh Shende on
TIKA-2382, and I've dumped the proposed output in the RecursiveParserWrapper format.
There are 10 metadata objects. The first contains the main page, and then there are 9 scripts.
I'm not sure what we should do with the src= info, when a script relies on an external resource rather than inlining the code.
Dumb question: what other types besides js can we have? Should we have a mapping from type= to mimetype that we can pass in to the child's metadata?
For now, we're still ignoring <style> elements.
I'd want to require users to turn this behavior on via an HTMLParserConfig.
Big question, what do you think? Other areas for improvements?