We seem not to extract the optional hyphen character correctly in
the Word parser.
You can create this char in Word by typing ctrl and -. It's hidden,
normally; you have to turn on display of formatting marks to see it.
Ideally we'd get U+00AD (unicode soft hyphen), I think.
DOC produces a unicode replacement char, which is wrong.
DOCX and PDF drop the char (which seems acceptable). RTF produces
U+2027 (hyphenation point) which also seems OK (in
TIKA-683 it will
PPT and PPTX work correctly (U+00AD).
So DOC is the only bug I think – I haven't dug into what's wrong
|Field||Original Value||New Value|
|Attachment||TIKA-711.patch [ 12493807 ]|
|Attachment||testOptionalHyphen.doc [ 12493808 ]|
|Attachment||testOptionalHyphen.docx [ 12493809 ]|
|Attachment||testOptionalHyphen.pdf [ 12493810 ]|
|Attachment||testOptionalHyphen.ppt [ 12493811 ]|
|Attachment||testOptionalHyphen.pptx [ 12493812 ]|
|Attachment||testOptionalHyphen.rtf [ 12493813 ]|
|Fix Version/s||0.10 [ 12313535 ]|
|Assignee||Michael McCandless [ mikemccand ]|
|Status||Open [ 1 ]||Resolved [ 5 ]|
|Fix Version/s||1.0 [ 12317967 ]|
|Resolution||Fixed [ 1 ]|
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|24d 1h 45m||1||Michael McCandless||03/Oct/11 18:26|