Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.17
-
None
-
None
-
None
-
MacOS 10.13.2 JDK8
Description
Steps to reproduce:
- Using safari save any web page as "webarchive"
- Use tika to extract the archive content like the example below
Expected result:
I would expect tika to extract the html contents from the webarchive
Actual results:
Nothing is extracted albeit the right mime type is identified.
try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, Charsets.UTF_8)) { TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); // this looks for content anywhere in the page independently of orientation tesseractOCRConfig.setPageSegMode("11"); ParseContext context = new ParseContext(); context.set(Parser.class, tika.getParser()); context.set(TesseractOCRConfig.class, tesseractOCRConfig); try (InputStream fd = Files.newInputStream(path)) { tika.getParser().parse(fd, new WriteOutContentHandler(writer), new Metadata(), context); } catch (SAXException e) { throw new EngineError(e); }
Attachments
Attachments
Issue Links
- relates to
-
TIKA-2923 Add parser for binary plist
- Open