Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2435

docx parser missing content when multiple body sections

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.17
    • Component/s: None
    • Labels:
      None

      Description

      On https://bz.apache.org/bugzilla/show_bug.cgi?id=61354, [~kramachandran@commvault.com] reported that our DOM parser was missing "body" sections after the first body section in docx. PJ Fanning applied the patch, and this will be available when we upgrade to POI 3.17-beta2.

      As a side note, the experimental SAX parser was correctly extracting all text from the triggering document.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: