Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.9
-
None
-
None
Description
Hello
I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
--------------------------
هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.
) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (
همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:
1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي
4- خوردن عسل 5- خوردن عدس 6- خوردن گوشت نزديک گردن
--------------------------
Tike returns this output !
--------------------------
92 @A 8 * B
C9D !D ) =/
>(<) , 8 ;
8 #+ 9!:
L
#) 4 M() * 0>
- -3 IA J
- 2 (+ G
H -1
(+ J 5#C 0T J ( O - 6 R . (+ O - 5 PH. (+ O -4
--------------------------
thanks a lot
Attachments
Attachments
Issue Links
- depends upon
-
PDFBOX-1127 PDF supplies glyph->unicode mapping, but PDFBox doesn't use it.
- Closed
- is related to
-
TIKA-1337 LanguageProfile for Persian/Farsi
- Resolved
this is a persian pdf file that Tika can't parse it.