[TIKA-3545] TIKA PDF parsing issues - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Bug
Affects Version/s: 1.21
Fix Version/s: None
Component/s: parser, tika-server
Labels:
None
Environment:

Tested on DEV env

Description

I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika dependencies in Manifoldcf 2.14 version to crawl some files, Out of which some of the PDF's files are not getting parsed correctly.
Getting some issues while parsing PDF files. Some strange characters appeared, tried changing Tika jar files version also 1.24 and 1.27 (for 1.27-it didn't even extract files correctly).

Also checked with the document content, it seems to be fine.
Can anybody help me on this.

Image attached for reference of strange characters.

Tried version changing , but didn't help

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Manage Attachments

365.jpg
08/Sep/21 06:31
81 kB
Priya
Handleiding Digitale koffiecorner.pdf
08/Sep/21 07:21
295 kB
Priya
KSF1.pdf
08/Sep/21 07:21
64 kB
Priya
KSF1.txt
08/Sep/21 15:51
1 kB
Tim Allison

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Priya

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Sep/21 06:31

Updated:: 01/Oct/21 03:37

Resolved:: 01/Oct/21 03:37

Agile

View on Board

TIKA PDF parsing issues

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment