Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2640

MS Word document checkboxes and dropdowns not fully converted to text

    XMLWordPrintableJSON

Details

    Description

      When we use Tika to parse the text from a Microsoft Word document (.doc) file with a check box we get FORMCHECKBOX with no indication as to whether it is checked or not.

      When the doc has a dropdown menu we get FORMDROPDOWN with no indication as to which was selected.

      If we parse to XHTML instead we still get e.g.

       

      <tr> <td><p class="header">Another kind of incident</p>
      </td> <td><p class="header"><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" />|_|</p>
      </td> <td><p />
      </td></tr>
       
      

      even though the checkbox is ticked in the doc (checkboxes always show ||_).

      Shouldn't the text reflect the checkbox as it does in the testCheckboxes() method in https://svn.apache.org/repos/asf/poi/trunk/src/ooxml/testcases/org/apache/poi/xwpf/extractor/TestXWPFWordExtractor.java  (I realise this is POI but that is what Tika uses)?

      Snippet:

       

      XWPFDocument doc = XWPFTestDataSamples.openSampleDocument("checkboxes.docx"); XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
      assertEquals("This is a small test for checkboxes \nunchecked: |_| \n" + "Or checked: |X|\n\n\n\n\n" + "Test a checkbox within a textbox: |_| -> |X|\n\n\n" + "In Table:\n|_|\t|X|\n\n\n" + "In Sequence:\n|X||_||X|\n", extractor.getText());
      

       

       

      Our code:

      InputStream stream = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + fileName);
      String text = new Tika().parseToString(stream, new Metadata(), -1).trim();
      

       

      I have attached an example MS Word doc file with checkboxes and a dropdown.

      Regards and thanks, Pete

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            pete_openanswers Peter Davies
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: