Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4305

Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.9.2
    • None
    • tika-app, tika-core
    • None
    • Ubuntu 22.04 LTS

    Description

      Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.

      No logs or errors just an empty string.

      Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks havoc with non ascii characters.

      how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open dialog of gedit and found the outputs similar

       

      I am attaching all four encoded files along with tika's output from parsing the UTF-7 for reference

      Attachments

        1. multilingual_test_new_UCS-2.txt
          10 kB
          Manish S N
        2. multilingual_test_new_UCS-4.txt
          20 kB
          Manish S N
        3. multilingual_test_new_UTF-7.txt
          9 kB
          Manish S N
        4. multilingual_test_new_UTF-8.txt
          9 kB
          Manish S N
        5. pom.xml
          2 kB
          Manish S N
        6. tika_UTF-7_output.txt
          9 kB
          Manish S N

        Issue Links

          Activity

            People

              Unassigned Unassigned
              manish003 Manish S N
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: