Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-469

The Parser is not correctly outputting Arabic text documents

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.7
    • None
    • parser
    • None
    • Windows XP

    Description

      The parser is not preserving the character encoding when parsing documents in Arabic UTF-8, specifically with .pdf and .doc. The resulting character output is undechipherable or just question-mark symbols.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rcullen Robert Cullen
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: