Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-713

Tika can not parse all of the persian pdf files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9
    • None
    • parser
    • None

    Description

      Hello
      I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!

      I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
      --------------------------
      ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
      ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
      ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
      ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
      ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
      --------------------------
      Tike returns this output !
      --------------------------
      92 @A 8 * B
      C9D !D ) =/
      >

      (<) , 8 ;
      8 #

      + 9!:
      L
      #) 4 M() * 0>

      • -3 IA J
      • 2 (+ G
        H -1
        (+ J 5#C 0T J ( O - 6 R . (+ O - 5 PH. (+ O -4
        --------------------------

      thanks a lot

      Attachments

        1. Complex.pdf
          266 kB
          Ahmad Ajiloo
        2. ebrat.pdf
          73 kB
          Ahmad Ajiloo
        3. Simple2.pdf
          160 kB
          Ahmad Ajiloo
        4. Simple3.pdf
          367 kB
          Ahmad Ajiloo

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ahmad_aj Ahmad Ajiloo
              Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: