Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.20
-
None
-
None
-
None
Description
office series documents contain office series document attachment. Can the contents of the attachments be extracted as shown in the table below
doc | docx | xls | xlsx | ppt | pptx | |
txt | ||||||
xml | ||||||
doc | ||||||
docx | ||||||
xls | ||||||
xlsx | ||||||
ppt | ||||||
pptx |
1.If our use method is wrong, please help us use the correct way
File file = new File("XX"); Parser parser = new OfficeParser(); ParseContext context = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030"); metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName()); parser.parse(inputStream, handler, metadata, context);
2.We use Tika version: 1.20. Of course, we have replaced the latest version 2.0. This problem still exists.
3.If there is indeed this omission in the current version, please help us optimize it in subsequent versions