[TIKA-692] TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.10
Component/s: parser
Labels:
None

Description

[Note: spinoff from the tika-dev thread "Issue in text extraction in
Solr / Tika" on Aug 19 2011, by nirnaydewan]

When parsing a Word doc where some contiguous text is bolded, due to
differences in how the user had bolded different parts of the text
with Word, TikaCLI -x or -h will sometimes generate output like this:

<p>F<b>oob</b>a<b>r</b>
</p>

and other times like this (extra newline & 2 adjacent bold sections):

<p>F<b>oo</b>
<b>b</b>a<b>r</b>
</p>

The extra newline in the second example causes browsers (I tried
Firefox, Safari, Chrome), JTidy and Tika itself to (incorrectly)
insert a space when rending/extracting text, breaking up the word.

While this might be technically correct/OK (ie, XML white space rules
might allow for non-significant space after the </b> within a <p>
should be ignored), I think we should still fix Tika to not insert
newlines, if we can.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch
20/Aug/11 16:28
1 kB
Jukka Zitting
0002-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch
20/Aug/11 18:25
2 kB
Jukka Zitting
testWORD_bold_character_runs.doc
24/Aug/11 10:40
22 kB
Michael McCandless
testWORD_bold_character_runs.doc
20/Aug/11 15:51
22 kB
Michael McCandless
testWORD_bold_character_runs2.doc
20/Aug/11 15:51
22 kB
Michael McCandless
testWORD_bold_character_runs2.docx
24/Aug/11 10:40
10 kB
Michael McCandless
TIKA-692.patch
24/Aug/11 10:40
16 kB
Michael McCandless
TIKA-692.patch
20/Aug/11 16:14
10 kB
Michael McCandless
TIKA-692.patch
20/Aug/11 15:51
7 kB
Michael McCandless
TIKA-692-pretty-print.patch
21/Aug/11 14:53
5 kB
Michael McCandless

Activity

People

Assignee:: Jukka Zitting

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 20/Aug/11 15:37

Updated:: 20/Oct/11 12:34

Resolved:: 17/Sep/11 09:23