[TIKA-713] Tika can not parse all of the persian pdf files - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9
Fix Version/s: None
Component/s: parser
Labels:
None

Description

Hello
I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!

I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
--------------------------
‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
--------------------------
Tike returns this output !
--------------------------
92 @A 8 * B
C9D !D ) =/
>

(<) , 8 ;
8 #

+ 9!:
L
#) 4 M() * 0>

-3 IA J

2 (+ G
H -1
(+ J 5#C 0T J ( O - 6 R . (+ O - 5 PH. (+ O -4
--------------------------

thanks a lot

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Simple3.pdf
31/Oct/11 16:35
367 kB
Ahmad Ajiloo
Simple2.pdf
31/Oct/11 13:13
160 kB
Ahmad Ajiloo
ebrat.pdf
13/Sep/11 06:01
73 kB
Ahmad Ajiloo
Complex.pdf
31/Oct/11 16:35
266 kB
Ahmad Ajiloo

Issue Links

depends upon

PDFBOX-1127 PDF supplies glyph->unicode mapping, but PDFBox doesn't use it.

Closed

is related to

TIKA-1337 LanguageProfile for Persian/Farsi

Resolved

Activity

Ascending order - Click to sort in descending order

Ahmad Ajiloo added a comment - 13/Sep/11 06:01

this is a persian pdf file that Tika can't parse it.

Ahmad Ajiloo added a comment - 13/Sep/11 06:01 this is a persian pdf file that Tika can't parse it.

Robert Muir added a comment - 13/Sep/11 06:20

Thanks Ahmad... I took a look at this PDF and I suspect this is the problem:

The fonts contained in the document have custom font encodings, I opened them up in fontforge and e.g. arabic alef maps to U+0006.
So thats why you see the garbage, its actually unrelated to ICU/bidirectional algorithm.

I think the reason copy/paste works fine in this document is because it probably has unicode PDF metadata... maybe PDFBox doesn't support this?

Disclaimer: I didn't look at any pdfbox code yet or really try to debug it.

Robert Muir added a comment - 13/Sep/11 06:20 Thanks Ahmad... I took a look at this PDF and I suspect this is the problem: The fonts contained in the document have custom font encodings, I opened them up in fontforge and e.g. arabic alef maps to U+0006. So thats why you see the garbage, its actually unrelated to ICU/bidirectional algorithm. I think the reason copy/paste works fine in this document is because it probably has unicode PDF metadata... maybe PDFBox doesn't support this? Disclaimer: I didn't look at any pdfbox code yet or really try to debug it.

Robert Muir added a comment - 02/Oct/11 20:17

I created ~~PDFBOX-1127~~ for this with some screenshots and description of what is going on.

Robert Muir added a comment - 02/Oct/11 20:17 I created PDFBOX-1127 for this with some screenshots and description of what is going on.

Robert Muir added a comment - 03/Oct/11 15:43

This is now fixed in pdfbox's trunk. when tika upgrades to 1.7.0 i can attach a test.

Robert Muir added a comment - 03/Oct/11 15:43 This is now fixed in pdfbox's trunk. when tika upgrades to 1.7.0 i can attach a test.

Ahmad Ajiloo added a comment - 05/Oct/11 19:14

Thanks a lot

Ahmad Ajiloo added a comment - 05/Oct/11 19:14 Thanks a lot

Ahmad Ajiloo added a comment - 31/Oct/11 13:16

I'm testing new Encoding.java file with other persian pdf files. there is a new file which name is Simple2.pdf that pdfbox can not parse it. please find the attachment.
thanks

Ahmad Ajiloo added a comment - 31/Oct/11 13:16 I'm testing new Encoding.java file with other persian pdf files. there is a new file which name is Simple2.pdf that pdfbox can not parse it. please find the attachment. thanks

Robert Muir added a comment - 31/Oct/11 13:24

Thanks for uploading another test file Ahmad, we'll take a look!

Robert Muir added a comment - 31/Oct/11 13:24 Thanks for uploading another test file Ahmad, we'll take a look!

Ahmad Ajiloo added a comment - 31/Oct/11 16:35

I attached this two files for more researching. thanks for your attention

Ahmad Ajiloo added a comment - 31/Oct/11 16:35 I attached this two files for more researching. thanks for your attention

Robert Muir added a comment - 31/Oct/11 19:34

Thanks Ahmad, I took a quick glance (not a thorough inspection yet):

Complex.pdf should work, I am able to copy/paste the text from Acrobat
Simple3.pdf: Acrobat copy/paste yields the wrong persian characters. Could be a bug in the font.
Simple2.pdf: This one might be hopeless. Acrobat copy/paste yields trash, I think it is a totally custom font encoding.

I will look in more depth later.

Robert Muir added a comment - 31/Oct/11 19:34 Thanks Ahmad, I took a quick glance (not a thorough inspection yet): Complex.pdf should work, I am able to copy/paste the text from Acrobat Simple3.pdf: Acrobat copy/paste yields the wrong persian characters. Could be a bug in the font. Simple2.pdf: This one might be hopeless. Acrobat copy/paste yields trash, I think it is a totally custom font encoding. I will look in more depth later.

Ali Majdzadeh Kohbanani added a comment - 17/Oct/12 20:43

Ahmad,
Could you please explain how Complex.pdf is generated? What tool is used in order to create the file? The fonts? Any specific configuration, etc. I have tested PDFBox in order to extract text from Complex.pdf and it performs very well. By contrast, any other PDF file that I test for text extraction using PDFBox have lots of errors. I have tested creating PDF files using PDFCreator and "Save as PDF" plugin in MS-Word. In the first case, the extracted text contains only junk characters and the latter some glyphs and ligatures are extracted wrongly. I have filed a bug report for PDFBox but in order to further testing PDFBox, I would like to know more about the method used in order to create Complex.pdf. Thanks a lot.

Ali Majdzadeh Kohbanani added a comment - 17/Oct/12 20:43 Ahmad, Could you please explain how Complex.pdf is generated? What tool is used in order to create the file? The fonts? Any specific configuration, etc. I have tested PDFBox in order to extract text from Complex.pdf and it performs very well. By contrast, any other PDF file that I test for text extraction using PDFBox have lots of errors. I have tested creating PDF files using PDFCreator and "Save as PDF" plugin in MS-Word. In the first case, the extracted text contains only junk characters and the latter some glyphs and ligatures are extracted wrongly. I have filed a bug report for PDFBox but in order to further testing PDFBox, I would like to know more about the method used in order to create Complex.pdf. Thanks a lot.

Shayan Tabrizi added a comment - 13/Mar/13 19:33

As I know, there is some kind of complexity in extracting Persian text from PDFs. For example, selected text in Foxit Reader and other PDF readers is corrupted in most of the cases. The only reader I used that could overcome this problem, is Adobe Acrobat. But I don't know what exactly the source of the problem is. And solving this problem is very very necessary for the Persian community. I see many people looking for a solution to this problem.

Shayan Tabrizi added a comment - 13/Mar/13 19:33 As I know, there is some kind of complexity in extracting Persian text from PDFs. For example, selected text in Foxit Reader and other PDF readers is corrupted in most of the cases. The only reader I used that could overcome this problem, is Adobe Acrobat. But I don't know what exactly the source of the problem is. And solving this problem is very very necessary for the Persian community. I see many people looking for a solution to this problem.

Robert Muir added a comment - 14/Mar/13 13:59

Even acrobat cannot extract the text from Simple2.pdf: its a custom font encoding.

Robert Muir added a comment - 14/Mar/13 13:59 Even acrobat cannot extract the text from Simple2.pdf: its a custom font encoding.

Shayan Tabrizi added a comment - 14/Mar/13 14:11

Adobe Acrobat is not a magician. It probably cannot handle custom font encodings. But at least for many of normal PDFs it can handle it.

Shayan Tabrizi added a comment - 14/Mar/13 14:11 Adobe Acrobat is not a magician. It probably cannot handle custom font encodings. But at least for many of normal PDFs it can handle it.

Omid Pourhadi added a comment - 16/Jun/14 06:54

Hi,
Since you have used Microsoft word converter to PDF I can not extract fonts from your PDF. can you tell me what kind of Persian font you have used ?

Omid Pourhadi added a comment - 16/Jun/14 06:54 Hi, Since you have used Microsoft word converter to PDF I can not extract fonts from your PDF. can you tell me what kind of Persian font you have used ?

People

Assignee:: Unassigned

Reporter:: Ahmad Ajiloo

Votes:: 2 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Sep/11 05:58

Updated:: 01/Mar/15 22:47

Resolved:: 01/Mar/15 22:47