Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2543

No content extraction for application/x-webarchive format

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.17
    • None
    • None
    • None
    • MacOS 10.13.2 JDK8

    Description

      Steps to reproduce:

      1. Using safari save any web page as "webarchive"
      2. Use tika to extract the archive content like the example below

      Expected result:
      I would expect tika to extract the html contents from the webarchive
      Actual results:
      Nothing is extracted albeit the right mime type is identified.

       try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, Charsets.UTF_8)) {
            TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
      
            // this looks for content anywhere in the page independently of orientation
            tesseractOCRConfig.setPageSegMode("11");
      
            ParseContext context = new ParseContext();
            context.set(Parser.class, tika.getParser());
            context.set(TesseractOCRConfig.class, tesseractOCRConfig);
      
            try (InputStream fd = Files.newInputStream(path)) {
              tika.getParser().parse(fd, new WriteOutContentHandler(writer), new Metadata(), context);
      
            } catch (SAXException e) {
              throw new EngineError(e);
            }
      

      Attachments

        1. tika.plist
          124 kB
          Tim Allison
        2. Apache Tika – Configuring Tika.webarchive
          86 kB
          Rafael Ferreira

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cleverfoo Rafael Ferreira
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: