Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1005

In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2
    • 1.3
    • parser
    • None
    • Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each)

    Description

      Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format.

      Attachments

        1. Textbox example.docx
          25 kB
          David A. Patterson
        2. TIKA-1005.patch
          4 kB
          Michael McCandless

        Activity

          People

            mikemccand Michael McCandless
            pattersonda01 David A. Patterson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: