Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-711

Word parser doesn't extract optional hyphen correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0
    • parser
    • None

    Description

      We seem not to extract the optional hyphen character correctly in
      the Word parser.

      You can create this char in Word by typing ctrl and -. It's hidden,
      normally; you have to turn on display of formatting marks to see it.

      Ideally we'd get U+00AD (unicode soft hyphen), I think.

      DOC produces a unicode replacement char, which is wrong.

      DOCX and PDF drop the char (which seems acceptable). RTF produces
      U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
      produce U+00AD).

      PPT and PPTX work correctly (U+00AD).

      So DOC is the only bug I think – I haven't dug into what's wrong
      yet...

      Attachments

        1. TIKA-711.patch
          34 kB
          Michael McCandless
        2. testOptionalHyphen.doc
          22 kB
          Michael McCandless
        3. testOptionalHyphen.docx
          10 kB
          Michael McCandless
        4. testOptionalHyphen.pdf
          44 kB
          Michael McCandless
        5. testOptionalHyphen.ppt
          99 kB
          Michael McCandless
        6. testOptionalHyphen.pptx
          32 kB
          Michael McCandless
        7. testOptionalHyphen.rtf
          30 kB
          Michael McCandless
        8. TIKA-711.patch
          3 kB
          Michael McCandless

        Activity

          People

            mikemccand Michael McCandless
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: