Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3526

cant extract content from attachments in Office docs created by WPS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.20
    • None
    • None
    • None

    Description

      office series documents contain office series document attachment. Can the contents of the attachments be extracted as shown in the table below

       

        doc docx xls xlsx ppt pptx
      txt
      pdf
      xml
      doc
      docx
      xls
      xlsx
      ppt
      pptx

       
      1.If our use method is wrong, please help us use the correct way

      File file = new File("XX"); 
      Parser parser = new OfficeParser(); 
       ParseContext context = new ParseContext();
       Metadata metadata = new Metadata();
      
      metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
      metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
      parser.parse(inputStream, handler, metadata, context);
      

       
      2.We use Tika version: 1.20. Of course, we have replaced the latest version 2.0. This problem still exists.
       
      3.If there is indeed this omission in the current version, please help us optimize it in subsequent versions
       

      Attachments

        1. embedded attachment.doc
          218 kB
          matcha007
        2. embedded attachment.docx
          127 kB
          matcha007
        3. embedded attachment.ppt
          263 kB
          matcha007
        4. embedded attachment.pptx
          148 kB
          matcha007
        5. embedded attachment.xls
          225 kB
          matcha007
        6. embedded attachment.xlsx
          128 kB
          matcha007
        7. image-2021-12-03-11-04-38-478.png
          14 kB
          matcha007
        8. image-2021-12-03-11-05-51-182.png
          12 kB
          matcha007
        9. image-2021-12-03-11-06-44-697.png
          12 kB
          matcha007
        10. image-2021-12-03-11-07-33-659.png
          13 kB
          matcha007
        11. image-2021-12-03-11-11-29-649.png
          36 kB
          matcha007
        12. image-2021-12-03-11-15-51-328.png
          32 kB
          matcha007
        13. TIKA-3526.pptx
          55 kB
          Tim Allison

        Activity

          People

            Unassigned Unassigned
            matcha007 matcha007
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: