[PDFBOX-5868] PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.32, 3.0.3 PDFBox
Fix Version/s: 2.0.33, 3.0.4 PDFBox, 4.0.0
Component/s: Text extraction
Labels:
- ActualText
Environment:
Ubuntu 22.04.4 LTS x86_64

Description

I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used the export:text command line tool to obtain the results

the multilingual_test.pdf is the original pdf i made to test multilingual text extraction.
the pdfbox_out.txt is the text file produced by pdfbox
the adobe_out.txt is the text file created by adobe reader's save as text feature

Observation:

as you can see in the attachment the text file obtained by pdfbox shows weird unicodes for tamil and bengali (for hindi the charecters are extracted but not overlapped; japanese seems fine to me). in contrast the text file file obtained from adobe reader's save as text feature seems fine and copy pasting the text from my document viewer(evince) also works.

Questions:

why are the outputs from pdfbox and adobe different?
what can i do to extract the text from a multilingual pdf correctly?
Is there a way to apply pattern matching to text in pdf file and declare matches without extracting the text first? (say if the problem is with fonts and glyphs)

—

My Usecase fyi:

i am trying to extract text from files and run pattern matching. I am using apache tika for parsing documents. I noticed problem with extracted PDF text (other filetypes parse fine). used executable pdfbox jar to conclude that the problem is in pdfbox and not in tika. tested with adobe reader's extract text to confirm the problem is not with the pdf. i want to extract these multilingual text to run pattern matching on them alone and do not need to display the content but only if the pattern is present or not (say if the problem is with fonts and glyphs)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

multilingual_test.pdf
14/Aug/24 07:04
85 kB
Manish S N
adobe_out.txt
14/Aug/24 07:04
4 kB
Manish S N
pdfbox_out.txt
14/Aug/24 07:04
4 kB
Manish S N
screenshot-1.png
14/Aug/24 10:28
73 kB
Tilman Hausherr
okular_out.txt
16/Aug/24 06:51
4 kB
Manish S N
poppler_out.txt
16/Aug/24 07:07
4 kB
Manish S N
Main.java
16/Aug/24 13:21
0.7 kB
Manish S N
Tilman's_solution_out.txt
16/Aug/24 13:21
4 kB
Manish S N
screenshot-2.png
16/Aug/24 19:00
19 kB
Tilman Hausherr
suppressDuplicateOverlapping_out.txt
17/Aug/24 07:46
4 kB
Manish S N
PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf
18/Aug/24 17:54
9 kB
Tilman Hausherr
PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf
18/Aug/24 17:54
315 kB
Tilman Hausherr
PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf
18/Aug/24 17:54
123 kB
Tilman Hausherr
image-2024-08-19-10-38-13-472.png
19/Aug/24 05:08
10 kB
Manish S N
EmptyActualText_poppler.txt
19/Aug/24 05:10
2 kB
Manish S N
EmptyActualText_reduced_poppler.txt
19/Aug/24 05:10
0.0 kB
Manish S N
content_diffs_with_exceptions-ActualText.xlsx
19/Aug/24 07:35
1.62 MB
Tilman Hausherr
page.pdf
30/Aug/24 11:38
124 kB
Manish S N
image-2024-08-30-17-55-41-423.png
30/Aug/24 12:25
6 kB
Manish S N

Issue Links

duplicates

PDFBOX-3248 Unwanted spaces in text extraction (2)

Closed

is duplicated by

PDFBOX-4532 PDFTextStripper replacing the decimal with white space

Closed

TIKA-4231 Parsing Arabic PDF is returning bad data

Closed

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Manish S N

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Aug/24 07:33

Updated:: 30/Aug/24 12:43