Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1130

.docx text extract leaves out some portions of text

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.2, 1.3
    • 1.5
    • parser
    • None
    • OpenJDK x86_64

    Description

      When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted.

      I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine.

      Looking at the document.xml portion of the .docx zip file shows the text is all there.

      Attachments

        1. Resume 6.4.13.docx
          125 kB
          Daniel Gibby
        2. TIKA-1130.patch
          11 kB
          Tim Allison
        3. TIKA-1130.patch
          11 kB
          Tim Allison
        4. tee internal resme.docx
          39 kB
          Daniel Gibby
        5. OwenResume.docx
          45 kB
          Daniel Gibby

        Activity

          People

            Unassigned Unassigned
            dangby Daniel Gibby
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: