Description
The following HTML document :
<html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
is rendered as the following xhtml by Tika :
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
with the lang attribute getting lost. The lang is not stored in the metadata either.
Attachments
Attachments
Issue Links
- blocks
-
NUTCH-794 Language Identification must use check the parse metadata for language values
- Closed
-
NUTCH-817 parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1
- Closed
- is related to
-
TIKA-478 HtmlParser can emit <head> elements inside of <body> block
- Resolved