Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-711

Word parser doesn't extract optional hyphen correctly

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      We seem not to extract the optional hyphen character correctly in
      the Word parser.

      You can create this char in Word by typing ctrl and -. It's hidden,
      normally; you have to turn on display of formatting marks to see it.

      Ideally we'd get U+00AD (unicode soft hyphen), I think.

      DOC produces a unicode replacement char, which is wrong.

      DOCX and PDF drop the char (which seems acceptable). RTF produces
      U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
      produce U+00AD).

      PPT and PPTX work correctly (U+00AD).

      So DOC is the only bug I think – I haven't dug into what's wrong
      yet...

        Attachments

        1. TIKA-711.patch
          34 kB
          Michael McCandless
        2. TIKA-711.patch
          3 kB
          Michael McCandless
        3. testOptionalHyphen.rtf
          30 kB
          Michael McCandless
        4. testOptionalHyphen.pptx
          32 kB
          Michael McCandless
        5. testOptionalHyphen.ppt
          99 kB
          Michael McCandless
        6. testOptionalHyphen.pdf
          44 kB
          Michael McCandless
        7. testOptionalHyphen.docx
          10 kB
          Michael McCandless
        8. testOptionalHyphen.doc
          22 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: