[TIKA-2543] No content extraction for application/x-webarchive format - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.17
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

MacOS 10.13.2 JDK8

Description

Steps to reproduce:

Using safari save any web page as "webarchive"
Use tika to extract the archive content like the example below

Expected result:
I would expect tika to extract the html contents from the webarchive
Actual results:
Nothing is extracted albeit the right mime type is identified.

 try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, Charsets.UTF_8)) {
      TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();

      // this looks for content anywhere in the page independently of orientation
      tesseractOCRConfig.setPageSegMode("11");

      ParseContext context = new ParseContext();
      context.set(Parser.class, tika.getParser());
      context.set(TesseractOCRConfig.class, tesseractOCRConfig);

      try (InputStream fd = Files.newInputStream(path)) {
        tika.getParser().parse(fd, new WriteOutContentHandler(writer), new Metadata(), context);

      } catch (SAXException e) {
        throw new EngineError(e);
      }

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Apache Tika – Configuring Tika.webarchive
22/Jan/18 02:36
86 kB
Rafael Ferreira
tika.plist
17/Oct/18 16:14
124 kB
Tim Allison

Issue Links

relates to

TIKA-2923 Add parser for binary plist

Open

Activity

People

Assignee:: Unassigned

Reporter:: Rafael Ferreira

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Jan/18 06:16

Updated:: 29/Oct/19 15:52