Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Whatever the input HTML meta are, tika's HTML meta can only have a "name" and a "content" attribute. This gives invalid HTML meta tags for in the output.
For instance, the following valid HTML file
<!DOCTYPE html> <html lang="en"> <head> <title>Title</title> <meta http-equiv="refresh" content="0; url=http://example.com"> </head> <body></body> </html>
is transformed into a SAX stream corresponding to the following HTML :
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="dc:title" content="Title"/> <meta name="Content-Encoding" content="ISO-8859-1"/> <meta name="refresh" content="0; url=http://example.com"/> <meta name="Content-Type" content="text/html; charset=ISO-8859-1"/> <title>Title</title> </head> <body/></html>
(the redirection, content-type, and content-encoding are all specified in a non-standard way)
The information that the original file had an "http-equiv" meta tag is lost, and replaced by a generic "meta name=" tag.
This is annoying when working with classes expecting valid meta redirection, such as Nutch's HTMLMetaProcessor, for instance.