[TIKA-3526] cant extract content from attachments in Office docs created by WPS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.20
Fix Version/s: None
Component/s: None
Labels:
None

Description

office series documents contain office series document attachment. Can the contents of the attachments be extracted as shown in the table below

	doc	docx	xls	xlsx	ppt	pptx
txt
pdf
xml
doc
docx
xls
xlsx
ppt
pptx

1.If our use method is wrong, please help us use the correct way

File file = new File("XX"); 
Parser parser = new OfficeParser(); 
 ParseContext context = new ParseContext();
 Metadata metadata = new Metadata();

metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
parser.parse(inputStream, handler, metadata, context);

2.We use Tika version: 1.20. Of course, we have replaced the latest version 2.0. This problem still exists.

3.If there is indeed this omission in the current version, please help us optimize it in subsequent versions

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

embedded attachment.doc
18/Aug/21 05:38
218 kB
matcha007
embedded attachment.docx
18/Aug/21 05:38
127 kB
matcha007
embedded attachment.ppt
18/Aug/21 05:38
263 kB
matcha007
embedded attachment.pptx
18/Aug/21 05:38
148 kB
matcha007
embedded attachment.xls
18/Aug/21 05:38
225 kB
matcha007
embedded attachment.xlsx
18/Aug/21 05:39
128 kB
matcha007
image-2021-12-03-11-04-38-478.png
03/Dec/21 03:04
14 kB
matcha007
image-2021-12-03-11-05-51-182.png
03/Dec/21 03:05
12 kB
matcha007
image-2021-12-03-11-06-44-697.png
03/Dec/21 03:06
12 kB
matcha007
image-2021-12-03-11-07-33-659.png
03/Dec/21 03:07
13 kB
matcha007
image-2021-12-03-11-11-29-649.png
03/Dec/21 03:11
36 kB
matcha007
image-2021-12-03-11-15-51-328.png
03/Dec/21 03:15
32 kB
matcha007
TIKA-3526.pptx
17/Aug/21 10:22
55 kB
Tim Allison

Activity

People

Assignee:: Unassigned

Reporter:: matcha007

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Aug/21 08:53

Updated:: 07/Dec/21 22:13