Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3008

Word Doc/Docx Formatting Extraction - Superscript/Subscript

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.23
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      Word extraction from .doc/.docx doesn't handle Superscript/Subscript at all.

      This changes the actual text extracted since character runs are merged together if only sup/sub is the difference since it doesn't generate any tags in between.

      Found to be especially problematic in case of some legal documents where getting "according to Art 51" instead of "according to Art 5^1^" completely changes the meaning.

       

      Problem seems to be both in old Word .doc and OOXML .docx formats parsing.

      Sub/sup can be present on actual character run or on the document style assigned to a character run.

       

      I'm already working on fixes and test documents, will comment with work in progress branch.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              cristian.vat Cristian Vat
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: