Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.18
-
None
-
None
-
Description
When we use Tika to parse the text from a Microsoft Word document (.doc) file with a check box we get FORMCHECKBOX with no indication as to whether it is checked or not.
When the doc has a dropdown menu we get FORMDROPDOWN with no indication as to which was selected.
If we parse to XHTML instead we still get e.g.
<tr> <td><p class="header">Another kind of incident</p> </td> <td><p class="header"><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" />|_|</p> </td> <td><p /> </td></tr>
even though the checkbox is ticked in the doc (checkboxes always show ||_).
Shouldn't the text reflect the checkbox as it does in the testCheckboxes() method in https://svn.apache.org/repos/asf/poi/trunk/src/ooxml/testcases/org/apache/poi/xwpf/extractor/TestXWPFWordExtractor.java (I realise this is POI but that is what Tika uses)?
Snippet:
XWPFDocument doc = XWPFTestDataSamples.openSampleDocument("checkboxes.docx"); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); assertEquals("This is a small test for checkboxes \nunchecked: |_| \n" + "Or checked: |X|\n\n\n\n\n" + "Test a checkbox within a textbox: |_| -> |X|\n\n\n" + "In Table:\n|_|\t|X|\n\n\n" + "In Sequence:\n|X||_||X|\n", extractor.getText());
Our code:
InputStream stream = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + fileName); String text = new Tika().parseToString(stream, new Metadata(), -1).trim();
I have attached an example MS Word doc file with checkboxes and a dropdown.
Regards and thanks, Pete