[TIKA-2640] MS Word document checkboxes and dropdowns not fully converted to text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.18
Fix Version/s: None
Component/s: core
Labels:
None
Environment:

Hide

MSWordDocWithCheckboxesAndDropdowns.doc

Show
MSWordDocWithCheckboxesAndDropdowns.doc

Description

When we use Tika to parse the text from a Microsoft Word document (.doc) file with a check box we get FORMCHECKBOX with no indication as to whether it is checked or not.

When the doc has a dropdown menu we get FORMDROPDOWN with no indication as to which was selected.

If we parse to XHTML instead we still get e.g.

<tr> <td><p class="header">Another kind of incident</p>
</td> <td><p class="header"><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" />|_|</p>
</td> <td><p />
</td></tr>

even though the checkbox is ticked in the doc (checkboxes always show ||_).

Shouldn't the text reflect the checkbox as it does in the testCheckboxes() method in https://svn.apache.org/repos/asf/poi/trunk/src/ooxml/testcases/org/apache/poi/xwpf/extractor/TestXWPFWordExtractor.java (I realise this is POI but that is what Tika uses)?

Snippet:

XWPFDocument doc = XWPFTestDataSamples.openSampleDocument("checkboxes.docx"); XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
assertEquals("This is a small test for checkboxes \nunchecked: |_| \n" + "Or checked: |X|\n\n\n\n\n" + "Test a checkbox within a textbox: |_| -> |X|\n\n\n" + "In Table:\n|_|\t|X|\n\n\n" + "In Sequence:\n|X||_||X|\n", extractor.getText());

Our code:

InputStream stream = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + fileName);
String text = new Tika().parseToString(stream, new Metadata(), -1).trim();

I have attached an example MS Word doc file with checkboxes and a dropdown.

Regards and thanks, Pete

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MSWordDocWithCheckboxesAndDropdowns.doc
02/May/18 07:09
26 kB
Peter Davies

Activity

People

Assignee:: Unassigned

Reporter:: Peter Davies

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/May/18 07:10

Updated:: 03/May/18 08:11