Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-713

Tika can not parse all of the persian pdf files

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9
    • None
    • parser
    • None

    Description

      Hello
      I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!

      I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
      --------------------------
      ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
      ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
      ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
      ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
      ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
      --------------------------
      Tike returns this output !
      --------------------------
      92 @A 8 * B
      C9D !D ) =/
      >

      (<) , 8 ;
      8 #

      + 9!:
      L
      #) 4 M() * 0>

      • -3 IA J
      • 2 (+ G
        H -1
        (+ J 5#C 0T J ( O - 6 R . (+ O - 5 PH. (+ O -4
        --------------------------

      thanks a lot

      Attachments

        1. Simple3.pdf
          367 kB
          Ahmad Ajiloo
        2. Simple2.pdf
          160 kB
          Ahmad Ajiloo
        3. ebrat.pdf
          73 kB
          Ahmad Ajiloo
        4. Complex.pdf
          266 kB
          Ahmad Ajiloo

        Issue Links

          Activity

            ahmad_aj Ahmad Ajiloo added a comment -

            this is a persian pdf file that Tika can't parse it.

            ahmad_aj Ahmad Ajiloo added a comment - this is a persian pdf file that Tika can't parse it.
            rcmuir Robert Muir added a comment -

            Thanks Ahmad... I took a look at this PDF and I suspect this is the problem:

            The fonts contained in the document have custom font encodings, I opened them up in fontforge and e.g. arabic alef maps to U+0006.
            So thats why you see the garbage, its actually unrelated to ICU/bidirectional algorithm.

            I think the reason copy/paste works fine in this document is because it probably has unicode PDF metadata... maybe PDFBox doesn't support this?

            Disclaimer: I didn't look at any pdfbox code yet or really try to debug it.

            rcmuir Robert Muir added a comment - Thanks Ahmad... I took a look at this PDF and I suspect this is the problem: The fonts contained in the document have custom font encodings, I opened them up in fontforge and e.g. arabic alef maps to U+0006. So thats why you see the garbage, its actually unrelated to ICU/bidirectional algorithm. I think the reason copy/paste works fine in this document is because it probably has unicode PDF metadata... maybe PDFBox doesn't support this? Disclaimer: I didn't look at any pdfbox code yet or really try to debug it.
            rcmuir Robert Muir added a comment -

            I created PDFBOX-1127 for this with some screenshots and description of what is going on.

            rcmuir Robert Muir added a comment - I created PDFBOX-1127 for this with some screenshots and description of what is going on.
            rcmuir Robert Muir added a comment -

            This is now fixed in pdfbox's trunk. when tika upgrades to 1.7.0 i can attach a test.

            rcmuir Robert Muir added a comment - This is now fixed in pdfbox's trunk. when tika upgrades to 1.7.0 i can attach a test.
            ahmad_aj Ahmad Ajiloo added a comment -

            Thanks a lot

            ahmad_aj Ahmad Ajiloo added a comment - Thanks a lot
            ahmad_aj Ahmad Ajiloo added a comment -

            I'm testing new Encoding.java file with other persian pdf files. there is a new file which name is Simple2.pdf that pdfbox can not parse it. please find the attachment.
            thanks

            ahmad_aj Ahmad Ajiloo added a comment - I'm testing new Encoding.java file with other persian pdf files. there is a new file which name is Simple2.pdf that pdfbox can not parse it. please find the attachment. thanks
            rcmuir Robert Muir added a comment -

            Thanks for uploading another test file Ahmad, we'll take a look!

            rcmuir Robert Muir added a comment - Thanks for uploading another test file Ahmad, we'll take a look!
            ahmad_aj Ahmad Ajiloo added a comment -

            I attached this two files for more researching. thanks for your attention

            ahmad_aj Ahmad Ajiloo added a comment - I attached this two files for more researching. thanks for your attention
            rcmuir Robert Muir added a comment -

            Thanks Ahmad, I took a quick glance (not a thorough inspection yet):

            • Complex.pdf should work, I am able to copy/paste the text from Acrobat
            • Simple3.pdf: Acrobat copy/paste yields the wrong persian characters. Could be a bug in the font.
            • Simple2.pdf: This one might be hopeless. Acrobat copy/paste yields trash, I think it is a totally custom font encoding.

            I will look in more depth later.

            rcmuir Robert Muir added a comment - Thanks Ahmad, I took a quick glance (not a thorough inspection yet): Complex.pdf should work, I am able to copy/paste the text from Acrobat Simple3.pdf: Acrobat copy/paste yields the wrong persian characters. Could be a bug in the font. Simple2.pdf: This one might be hopeless. Acrobat copy/paste yields trash, I think it is a totally custom font encoding. I will look in more depth later.

            Ahmad,
            Could you please explain how Complex.pdf is generated? What tool is used in order to create the file? The fonts? Any specific configuration, etc. I have tested PDFBox in order to extract text from Complex.pdf and it performs very well. By contrast, any other PDF file that I test for text extraction using PDFBox have lots of errors. I have tested creating PDF files using PDFCreator and "Save as PDF" plugin in MS-Word. In the first case, the extracted text contains only junk characters and the latter some glyphs and ligatures are extracted wrongly. I have filed a bug report for PDFBox but in order to further testing PDFBox, I would like to know more about the method used in order to create Complex.pdf. Thanks a lot.

            majdzadeh Ali Majdzadeh Kohbanani added a comment - Ahmad, Could you please explain how Complex.pdf is generated? What tool is used in order to create the file? The fonts? Any specific configuration, etc. I have tested PDFBox in order to extract text from Complex.pdf and it performs very well. By contrast, any other PDF file that I test for text extraction using PDFBox have lots of errors. I have tested creating PDF files using PDFCreator and "Save as PDF" plugin in MS-Word. In the first case, the extracted text contains only junk characters and the latter some glyphs and ligatures are extracted wrongly. I have filed a bug report for PDFBox but in order to further testing PDFBox, I would like to know more about the method used in order to create Complex.pdf. Thanks a lot.
            shayantabrizi Shayan Tabrizi added a comment -

            As I know, there is some kind of complexity in extracting Persian text from PDFs. For example, selected text in Foxit Reader and other PDF readers is corrupted in most of the cases. The only reader I used that could overcome this problem, is Adobe Acrobat. But I don't know what exactly the source of the problem is. And solving this problem is very very necessary for the Persian community. I see many people looking for a solution to this problem.

            shayantabrizi Shayan Tabrizi added a comment - As I know, there is some kind of complexity in extracting Persian text from PDFs. For example, selected text in Foxit Reader and other PDF readers is corrupted in most of the cases. The only reader I used that could overcome this problem, is Adobe Acrobat. But I don't know what exactly the source of the problem is. And solving this problem is very very necessary for the Persian community. I see many people looking for a solution to this problem.
            rcmuir Robert Muir added a comment -

            Even acrobat cannot extract the text from Simple2.pdf: its a custom font encoding.

            rcmuir Robert Muir added a comment - Even acrobat cannot extract the text from Simple2.pdf: its a custom font encoding.
            shayantabrizi Shayan Tabrizi added a comment -

            Adobe Acrobat is not a magician. It probably cannot handle custom font encodings. But at least for many of normal PDFs it can handle it.

            shayantabrizi Shayan Tabrizi added a comment - Adobe Acrobat is not a magician. It probably cannot handle custom font encodings. But at least for many of normal PDFs it can handle it.
            omidp Omid Pourhadi added a comment -

            Hi,
            Since you have used Microsoft word converter to PDF I can not extract fonts from your PDF. can you tell me what kind of Persian font you have used ?

            omidp Omid Pourhadi added a comment - Hi, Since you have used Microsoft word converter to PDF I can not extract fonts from your PDF. can you tell me what kind of Persian font you have used ?

            People

              Unassigned Unassigned
              ahmad_aj Ahmad Ajiloo
              Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: