Tika
  1. Tika
  2. TIKA-711

Word parser doesn't extract optional hyphen correctly

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      We seem not to extract the optional hyphen character correctly in
      the Word parser.

      You can create this char in Word by typing ctrl and -. It's hidden,
      normally; you have to turn on display of formatting marks to see it.

      Ideally we'd get U+00AD (unicode soft hyphen), I think.

      DOC produces a unicode replacement char, which is wrong.

      DOCX and PDF drop the char (which seems acceptable). RTF produces
      U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
      produce U+00AD).

      PPT and PPTX work correctly (U+00AD).

      So DOC is the only bug I think – I haven't dug into what's wrong
      yet...

      1. testOptionalHyphen.doc
        22 kB
        Michael McCandless
      2. testOptionalHyphen.docx
        10 kB
        Michael McCandless
      3. testOptionalHyphen.pdf
        44 kB
        Michael McCandless
      4. testOptionalHyphen.ppt
        99 kB
        Michael McCandless
      5. testOptionalHyphen.pptx
        32 kB
        Michael McCandless
      6. testOptionalHyphen.rtf
        30 kB
        Michael McCandless
      7. TIKA-711.patch
        3 kB
        Michael McCandless
      8. TIKA-711.patch
        34 kB
        Michael McCandless

        Activity

        Michael McCandless created issue -
        Michael McCandless made changes -
        Field Original Value New Value
        Attachment TIKA-711.patch [ 12493807 ]
        Attachment testOptionalHyphen.doc [ 12493808 ]
        Attachment testOptionalHyphen.docx [ 12493809 ]
        Attachment testOptionalHyphen.pdf [ 12493810 ]
        Attachment testOptionalHyphen.ppt [ 12493811 ]
        Attachment testOptionalHyphen.pptx [ 12493812 ]
        Attachment testOptionalHyphen.rtf [ 12493813 ]
        Jukka Zitting made changes -
        Fix Version/s 0.10 [ 12313535 ]
        Michael McCandless made changes -
        Assignee Michael McCandless [ mikemccand ]
        Michael McCandless made changes -
        Attachment TIKA-711.patch [ 12497399 ]
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.0 [ 12317967 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development