Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1130

.docx text extract leaves out some portions of text

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.2, 1.3
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
      None
    • Environment:

      OpenJDK x86_64

      Description

      When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted.

      I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine.

      Looking at the document.xml portion of the .docx zip file shows the text is all there.

        Attachments

        1. OwenResume.docx
          45 kB
          Daniel Gibby
        2. tee internal resme.docx
          39 kB
          Daniel Gibby
        3. TIKA-1130.patch
          11 kB
          Tim Allison
        4. TIKA-1130.patch
          11 kB
          Tim Allison
        5. Resume 6.4.13.docx
          125 kB
          Daniel Gibby

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dangby Daniel Gibby
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: