Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-423

Parse docx and output to text file missing words

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.7, 0.8, 0.9, 0.10
    • 1.1
    • parser
    • Windows and Mac

    Description

      I created a word document using Word 2007 on a Windows Server 2003 machine (using Remote desktop), it has also happened to someone else using Windows XP, with person names, country names, addresses, and a date. Some of these elements are tagged as "Smart Tags" by Word, and in the output of parsing by Tika, some words disappear.

      So a text fragment like the one below in Word:
      Smart tags typically are names like George Washington, Marilyn Monroe, Napoleon Bonaparte, etc. But they are automatically generated by Word, so it can be difficult to control how they are

      After running Tika from the command line (on OSX), java -jar /path/to/tika-app-0.7.jar -t /path/to/docx/document.docx > /path/to/output.txt will result in something like:
      Smart tags typically are names like , , Napoleon Bonaparte, etc. But they are automatically generated by Word, so it can be difficult to control how they are

      Note the missing names George Washington and Marilyn Monroe, Marilyn Monroe was one that was tagged by Word.

      While I've only tried this with Tika 0.7, my understanding is that it has been an issue since 0.3 at least.

      Removing all Smart tags from the document using Autocorrect options in Word will result in the correct output.

      Attachments

        1. tika_test.docx
          12 kB
          David Tran
        2. output.txt
          1 kB
          David Tran

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dtra David Tran
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: